Bug 1329472

Summary: Cannot recreate remote node resource
Product: Red Hat Enterprise Linux 7 Reporter: Andrew Beekhof <abeekhof>
Component: pcsAssignee: Tomas Jelinek <tojeline>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: medium Docs Contact:
Priority: low    
Version: 7.2CC: abeekhof, cfeist, cluster-maint, idevat, omular, rmarigny, rsteiger, tojeline
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pcs-0.9.152-5.el7 Doc Type: Bug Fix
Doc Text:
Cause: User removes a remote node from a cluster. Consequence: Pcs does not tell pacemaker the node is permanently gone and should be removed from pacemaker's internal structures. Pcs then refuses to create a resource or a remote node with the same name saying it already exists. Fix: Tell pacemaker the node was removed from the cluster. Result: It is possible to recreate the resource / remote node.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-03 20:58:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1303136    
Bug Blocks:    
Attachments:
Description Flags
proposed fix none

Description Andrew Beekhof 2016-04-22 03:07:34 UTC
Description of problem:

Cannot recreate remote node resource when there is an existing node entry (left over from a previous incarnation)

[root@overcloud-controller-0 heat-admin]# pcs resource disable overcloud-novacompute-2
[root@overcloud-controller-0 heat-admin]# pcs resource delete overcloud-novacompute-2
Attempting to stop: overcloud-novacompute-2...Stopped
[root@overcloud-controller-0 heat-admin]# pcs resource create overcloud-novacompute-2 remote reconnect_interval=240
Error: unable to create resource/fence device 'overcloud-novacompute-2', 'overcloud-novacompute-2' already exists on this system
[root@overcloud-controller-0 heat-admin]# cibadmin -Ql | grep -C 10 overcloud-novacompute-2
      <node id="overcloud-novacompute-0" type="remote" uname="overcloud-novacompute-0">
        <instance_attributes id="nodes-overcloud-novacompute-0">
          <nvpair id="nodes-overcloud-novacompute-0-osprole" name="osprole" value="compute"/>
        </instance_attributes>
      </node>
      <node id="overcloud-novacompute-1" type="remote" uname="overcloud-novacompute-1">
        <instance_attributes id="nodes-overcloud-novacompute-1">
          <nvpair id="nodes-overcloud-novacompute-1-osprole" name="osprole" value="compute"/>
        </instance_attributes>
      </node>
      <node type="remote" id="overcloud-novacompute-2" uname="overcloud-novacompute-2">
        <instance_attributes id="nodes-overcloud-novacompute-2">
          <nvpair id="nodes-overcloud-novacompute-2-osprole" name="osprole" value="compute"/>
        </instance_attributes>
      </node>
    </nodes>
    <resources>
      <primitive class="ocf" id="ip-192.0.2.6" provider="heartbeat" type="IPaddr2">
        <instance_attributes id="ip-192.0.2.6-instance_attributes">
          <nvpair id="ip-192.0.2.6-instance_attributes-ip" name="ip" value="192.0.2.6"/>
          <nvpair id="ip-192.0.2.6-instance_attributes-cidr_netmask" name="cidr_netmask" value="32"/>
        </instance_attributes>
        <operations>


Version-Release number of selected component (if applicable):

pcs-0.9.143-15.el7.x86_64

How reproducible:

100%

Steps to Reproduce:
1.  see above
2.
3.

Actual results:

Error: unable to create resource/fence device 'overcloud-novacompute-2', 'overcloud-novacompute-2' already exists on this system


Expected results:

resource is created

Additional info:

I think pcs' uniqueness checks are being slightly overzealous here.
The operation should be allowed to proceed

Comment 1 Andrew Beekhof 2016-04-22 03:12:55 UTC
Weirder... it seems to be happening at the cib level:

[root@overcloud-controller-0 heat-admin]# cibadmin --create -o resources --xml-text "     <primitive class="ocf" id="overcloud-novacompute-2.localdomain" provider="pacemaker" type="remote">
>         <instance_attributes id="overcloud-novacompute-2-instance_attributes">
>           <nvpair id="overcloud-novacompute-2-instance_attributes-reconnect_interval" name="reconnect_interval" value="240"/>
>         </instance_attributes>
>         <operations>
>           <op id="overcloud-novacompute-2-start-interval-0s" interval="0s" name="start" timeout="60"/>
>           <op id="overcloud-novacompute-2-stop-interval-0s" interval="0s" name="stop" timeout="60"/>
>           <op id="overcloud-novacompute-2-monitor-interval-20" interval="20" name="monitor"/>
>         </operations>
>       </primitive>
> "
Call cib_create failed (-76): Name not unique on network
<failed>
  <failed_update object_type="primitive" operation="cib_create" reason="Name not unique on network">
    <primitive/>
  </failed_update>
</failed>
[root@overcloud-controller-0 heat-admin]# cibadmin -Ql | grep -C 0 overcloud-novacompute-2
      <node type="remote" id="overcloud-novacompute-2" uname="overcloud-novacompute-2">
        <instance_attributes id="nodes-overcloud-novacompute-2">
          <nvpair id="nodes-overcloud-novacompute-2-osprole" name="osprole" value="compute"/>
--
    <node_state remote_node="true" id="overcloud-novacompute-2" uname="overcloud-novacompute-2" crm-debug-origin="do_update_resource" node_fenced="0">
      <transient_attributes id="overcloud-novacompute-2">
        <instance_attributes id="status-overcloud-novacompute-2"/>
--
      <lrm id="overcloud-novacompute-2">

Re-assigning

Comment 3 Tomas Jelinek 2016-04-22 07:05:31 UTC
This may be related to the fact pcs does not run crm_node -R when removing remote nodes: https://github.com/feist/pcs/issues/78

We also need to fix id uniqueness checks in pcs, as currently we search for an id in the whole cib including the status section, which is wrong: bz1303136
Maybe pcs should not search for existing id in nodes section as well? Let me know, thanks.

Comment 4 Ken Gaillot 2016-04-22 15:44:30 UTC
Agreed, pcs should do "crm_node -R" when removing a node, and that should fix this issue.

As far as what pcs should be looking at for name collisions, there are three places Pacemaker Remote nodes can show up:

1. The nodes section: not reliable, because they will have an entry here only if they have ever had a permanent node attribute set.

2. The status section: mostly reliable. They will have an entry here as long as they have ever been started.

3. The resources section: mostly reliable. You can check against the ID of any ocf:pacemaker:remote primitives configured, and the value of the remote-node attribute for any resource configured (i.e. guest nodes, usually for VirtualDomain resources, but could be any resource in theory). The only time this is not reliable is the situation described in this bz, i.e. they have been removed from the configuration but an old status entry is still present.

Bottom line, you could get away with just #2 or #3, but to be completely safe, check all three.

Pacemaker is correct in rejecting the addition in this case, because the old state info would cause problems if the same ID were reused. You could argue that pacemaker should automatically clear the state info when the node is removed from the configuration, so we should evaluate that possibility at some point.

Comment 6 Tomas Jelinek 2016-07-25 16:23:01 UTC
Created attachment 1183870 [details]
proposed fix

Test 1: pacemaker remote resource
[root@rh72-node1:~]# pcs resource create rh72-node3 remote
# force pacemaker to create a record in nodes section
# the fix needs to work both with and without this command
[root@rh72-node1:~]# pcs node attribute rh72-node3 foo=bar
[root@rh72-node1:~]# pcs resource delete rh72-node3
Attempting to stop: rh72-node3...Stopped
# before fix this failed
[root@rh72-node1:~]# pcs resource create rh72-node3 dummy
[root@rh72-node1:~]# echo $?
0

Test 2: remote-node attribute, remote-node remove
[root@rh72-node1:~]# pcs resource create anode dummy
[root@rh72-node1:~]# pcs cluster remote-node add rh72-node3 anode
# force pacemaker to create a record in nodes section
# the fix needs to work both with and without this command
[root@rh72-node1:~]# pcs node attribute rh72-node3 foo=bar
[root@rh72-node1:~]# pcs cluster remote-node remove rh72-node3
# before fix this failed
[root@rh72-node1:~]# pcs resource create rh72-node3 dummy
[root@rh72-node1:~]# echo $?
0

Test 3: remote-node attribute, resource update
[root@rh72-node1:~]# pcs resource create anode dummy
[root@rh72-node1:~]# pcs cluster remote-node add rh72-node3 anode
# force pacemaker to create a record in nodes section
# the fix needs to work both with and without this command
[root@rh72-node1:~]# pcs node attribute rh72-node3 foo=bar
[root@rh72-node1:~]# pcs resource update anode meta remote-node=
# before fix this failed
[root@rh72-node1:~]# pcs resource create rh72-node3 dummy
[root@rh72-node1:~]# echo $?
0

Test 4: remote-node attribute, resource meta
[root@rh72-node1:~]# pcs resource create anode dummy
[root@rh72-node1:~]# pcs cluster remote-node add rh72-node3 anode
# force pacemaker to create a record in nodes section
# the fix needs to work both with and without this command
[root@rh72-node1:~]# pcs node attribute rh72-node3 foo=bar
[root@rh72-node1:~]# pcs resource meta anode remote-node=
# before fix this failed
[root@rh72-node1:~]# pcs resource create rh72-node3 dummy
[root@rh72-node1:~]# echo $?
0

Test 5: remote-node attribute, resource delete
[root@rh72-node1:~]# pcs resource create anode dummy
[root@rh72-node1:~]# pcs cluster remote-node add rh72-node3 anode
# force pacemaker to create a record in nodes section
# the fix needs to work both with and without this command
[root@rh72-node1:~]# pcs node attribute rh72-node3 foo=bar
[root@rh72-node1:~]# pcs resource delete anode
Attempting to stop: anode...Stopped
# before fix this failed
[root@rh72-node1:~]# pcs resource create rh72-node3 dummy
[root@rh72-node1:~]# echo $?
0

Comment 7 Ivan Devat 2016-07-28 18:01:16 UTC
[vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2
# is there to force pacemaker to create a record in nodes section
# the fix needs to work both with and without this command

1)
Before Fix:
[vm-rhel72-1 ~] $ rpm -q pcs
pcs-0.9.152-4.el7.x86_64
[vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 remote
[vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2
[vm-rhel72-1 ~] $ pcs resource delete vm-rhel72-2
Deleting Resource - vm-rhel72-2
[vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy
Error: unable to create resource/fence device 'vm-rhel72-2', 'vm-rhel72-2' already exists on this system

After Fix:
[vm-rhel72-1 ~] $ rpm -q pcs
pcs-0.9.152-5.el7.x86_64

a)
[vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 remote
[vm-rhel72-1 ~] $ pcs resource delete vm-rhel72-2
Deleting Resource - vm-rhel72-2
[vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy
[vm-rhel72-1 ~] $ echo $?
0

b)
[vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 remote
[vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2
[vm-rhel72-1 ~] $ pcs resource delete vm-rhel72-2
Deleting Resource - vm-rhel72-2
[vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy
[vm-rhel72-1 ~] $ echo $?
0

2)
Before Fix:
[vm-rhel72-1 ~] $ rpm -q pcs
pcs-0.9.152-4.el7.x86_64
[vm-rhel72-1 ~] $ pcs resource create anode dummy
[vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode
[vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2
[vm-rhel72-1 ~] $ pcs cluster remote-node remove vm-rhel72-2
[vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy
Error: unable to create resource/fence device 'vm-rhel72-2', 'vm-rhel72-2' already exists on this system

After Fix:
[vm-rhel72-1 ~] $ rpm -q pcs
pcs-0.9.152-5.el7.x86_64

a)
[vm-rhel72-1 ~] $ pcs resource create anode dummy
[vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode
[vm-rhel72-1 ~] $ pcs cluster remote-node remove vm-rhel72-2
[vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy
[vm-rhel72-1 ~] $ echo $?
0

b)
[vm-rhel72-1 ~] $ pcs resource create anode dummy
[vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode
[vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2
[vm-rhel72-1 ~] $ pcs cluster remote-node remove vm-rhel72-2
[vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy
[vm-rhel72-1 ~] $ echo $?
0


3)
Before Fix:
[vm-rhel72-1 ~] $ rpm -q pcs
pcs-0.9.152-4.el7.x86_64
[vm-rhel72-1 ~] $ pcs resource create anode dummy
[vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode
[vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2
[vm-rhel72-1 ~] $ pcs resource update anode meta remote-node=
[vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy
Error: unable to create resource/fence device 'vm-rhel72-2', 'vm-rhel72-2' already exists on this system

After Fix:
[vm-rhel72-1 ~] $ rpm -q pcs
pcs-0.9.152-5.el7.x86_64

a)
[vm-rhel72-1 ~] $ pcs resource create anode dummy
[vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode
[vm-rhel72-1 ~] $ pcs resource update anode meta remote-node=
[vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy
[vm-rhel72-1 ~] $ echo $?
0

b)
[vm-rhel72-1 ~] $  pcs resource create anode dummy
[vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode
[vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2
[vm-rhel72-1 ~] $ pcs resource update anode meta remote-node=
[vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy
[vm-rhel72-1 ~] $ echo $?
0


4)
Before Fix:
[vm-rhel72-1 ~] $ rpm -q pcs
pcs-0.9.152-4.el7.x86_64
[vm-rhel72-1 ~] $ pcs resource create anode dummy
[vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode
[vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2
[vm-rhel72-1 ~] $ pcs resource meta anode remote-node=
[vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy
Error: unable to create resource/fence device 'vm-rhel72-2', 'vm-rhel72-2' already exists on this system

After Fix:
[vm-rhel72-1 ~] $ rpm -q pcs
pcs-0.9.152-5.el7.x86_64

a)
[vm-rhel72-1 ~] $ pcs resource create anode dummy
[vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode
[vm-rhel72-1 ~] $ pcs resource meta anode remote-node=
[vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy
[vm-rhel72-1 ~] $ echo $?
0

b)
[vm-rhel72-1 ~] $ pcs resource create anode dummy
[vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode
[vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2
[vm-rhel72-1 ~] $ pcs resource meta anode remote-node=
[vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy
[vm-rhel72-1 ~] $ echo $?
0


5)
Before Fix:
[vm-rhel72-1 ~] $ rpm -q pcs
pcs-0.9.152-4.el7.x86_64
[vm-rhel72-1 ~] $  pcs resource create anode dummy
[vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode
[vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2
[vm-rhel72-1 ~] $ pcs resource delete anode
Deleting Resource - anode
[vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy
Error: unable to create resource/fence device 'vm-rhel72-2', 'vm-rhel72-2' already exists on this system

After Fix:
[vm-rhel72-1 ~] $ rpm -q pcs
pcs-0.9.152-5.el7.x86_64

a)
[vm-rhel72-1 ~] $ pcs resource create anode dummy
[vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode
[vm-rhel72-1 ~] $ pcs resource delete anode
Deleting Resource - anode
[vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy
[vm-rhel72-1 ~] $ echo $?
0

b)
[vm-rhel72-1 ~] $ pcs resource create anode dummy
[vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode
[vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2
[vm-rhel72-1 ~] $ pcs resource delete anode
Deleting Resource - anode
[vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy
[vm-rhel72-1 ~] $ echo $?
0

Comment 14 errata-xmlrpc 2016-11-03 20:58:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2596.html