Bug 1303136

Summary:

Cannot create a new resource with the same name of a one failed and deleted before, until cleanup

Product:

Red Hat Enterprise Linux 7

Reporter:

Raoul Scarazzini <rscarazz>

Component:

pcs

Assignee:

Tomas Jelinek <tojeline>

Status:

CLOSED ERRATA

QA Contact:

cluster-qe <cluster-qe>

Severity:

medium

Docs Contact:

Priority:

low

Version:

7.2

CC:

abeekhof, cfeist, cluster-maint, fdinitto, idevat, michele, omular, rmarigny, rsteiger, skinjo, tojeline

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

pcs-0.9.152-5.el7

Doc Type:

Bug Fix

Doc Text:

Cause: User deletes a resource from a cluster. Consequence: Sometimes (depending on the cluster status and configuration) traces of the resource remain in the cluster and pcs then refuses to create a resource with the same name. Fix: Properly check if specified resource id really exists in the cluster. Result: It is possible to recreate the resource.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-11-03 20:56:42 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1329472

Attachments:

Description	Flags
proposed fix	none

Description Raoul Scarazzini 2016-01-29 16:17:59 UTC

Description of problem:

If you remove a resource which has failed actions you can't create a new resource with the same name until you do a cleanup.

Version-Release number of selected component (if applicable):

pacemaker-1.1.13-10.el7.x86_64
pcs-0.9.143-15.el7.x86_64

How reproducible:

Always, at least from my tests.

Steps to Reproduce:
1. Delete a resource which has got failed actions:

# sudo pcs resource delete nova-compute-checkevacuate
Removing Constraint - location-nova-compute-checkevacuate-clone
Removing Constraint - order-openstack-nova-conductor-clone-nova-compute-checkevacuate-clone-mandatory
Removing Constraint - order-nova-compute-checkevacuate-clone-nova-compute-clone-mandatory
Deleting Resource - nova-compute-checkevacuate

Removal is successful:

2. Try to create a resource with the same name:

# source ./overcloudrc; sudo pcs resource create nova-compute-checkevacuate ocf:openstack:nova-compute-wait auth_url=$OS_AUTH_URL username=$OS_USERNAME password=$OS_PASSWORD tenant_name=$OS_TENANT_NAME domain=localdomain no_shared_storage=1 op start timeout=300 --clone interleave=true --disabled --force
Error: unable to create resource/fence device 'nova-compute-checkevacuate', 'nova-compute-checkevacuate' already exists on this system

3. Try to delete the (nonexistent) resource:

# sudo pcs resource delete nova-compute-checkevacuate
Error: Resource 'nova-compute-checkevacuate' does not exist.

Actual results:

Error: unable to create resource/fence device 'nova-compute-checkevacuate', 'nova-compute-checkevacuate' already exists on this system

Expected results:

Resource is successfully created.

Additional info:

As a workaround things gets fixed if you clean up everything.

Comment 2 Tomas Jelinek 2016-02-19 15:54:04 UTC

See also this github issue: https://github.com/feist/pcs/issues/78

Comment 3 Zhaoming Zhang 2016-06-27 10:42:22 UTC

It seems I meet the same problem in my cluster.
Before the new version established，shall I use “crm_resource -C ” to avoid the problem?


[root@nas-210 ~]# 
[root@nas-210 ~]# 
[root@nas-210 ~]# 
[root@nas-210 ~]# 
[root@nas-210 ~]# /usr/sbin/cibadmin -l -Q |grep nas_samba
          <lrm_resource id="nas_samba" type="smb" class="systemd">
            <lrm_rsc_op id="nas_samba_last_0" operation_key="nas_samba_monitor_0" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.10" transition-key="72:220:7:94eec9d7-eafd-457c-93ec-cfe7a5e45232" transition-magic="0:7;72:220:7:94eec9d7-eafd-457c-93ec-cfe7a5e45232" on_node="node1" call-id="121" rc-code="7" op-status="0" interval="0" last-run="1466676416" last-rc-change="1466676416" exec-time="119" queue-time="1" op-digest="f2317cad3d54cec5d7d7aa7d0bf35cf8"/>
[root@nas-210 ~]# 
[root@nas-210 ~]# 
[root@nas-210 ~]# 
[root@nas-210 ~]# 
[root@nas-210 ~]# 
[root@nas-210 ~]# pcs resource create nas_samba systemd:smb op monitor start-delay=10s interval=15s timeout=20s --group nas_group
Error: unable to create resource/fence device 'nas_samba', 'nas_samba' already exists on this system
[root@nas-210 ~]# 
[root@nas-210 ~]# 
[root@nas-210 ~]# 
[root@nas-210 ~]# 
[root@nas-210 ~]# crm_node -R nas_samba
The supplied command is considered dangerous.  To prevent accidental destruction of the cluster, the --force flag is required in order to proceed.
[root@nas-210 ~]# 
[root@nas-210 ~]# 
[root@nas-210 ~]# crm_node --force -R nas_samba
[root@nas-210 ~]# 
[root@nas-210 ~]# 
[root@nas-210 ~]# /usr/sbin/cibadmin -l -Q |grep nas_samba
          <lrm_resource id="nas_samba" type="smb" class="systemd">
            <lrm_rsc_op id="nas_samba_last_0" operation_key="nas_samba_monitor_0" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.10" transition-key="72:220:7:94eec9d7-eafd-457c-93ec-cfe7a5e45232" transition-magic="0:7;72:220:7:94eec9d7-eafd-457c-93ec-cfe7a5e45232" on_node="node1" call-id="121" rc-code="7" op-status="0" interval="0" last-run="1466676416" last-rc-change="1466676416" exec-time="119" queue-time="1" op-digest="f2317cad3d54cec5d7d7aa7d0bf35cf8"/>
[root@nas-210 ~]# 
[root@nas-210 ~]# 
[root@nas-210 ~]# pcs resource create nas_samba systemd:smb op monitor start-delay=10s interval=15s timeout=20s --group nas_group
Error: unable to create resource/fence device 'nas_samba', 'nas_samba' already exists on this system
[root@nas-210 ~]# 
[root@nas-210 ~]# 
[root@nas-210 ~]# crm_node --force -R nas_samba
[root@nas-210 ~]# 
[root@nas-210 ~]# 
[root@nas-210 ~]# 
[root@nas-210 ~]# /usr/sbin/cibadmin -l -Q |grep nas_samba
          <lrm_resource id="nas_samba" type="smb" class="systemd">
            <lrm_rsc_op id="nas_samba_last_0" operation_key="nas_samba_monitor_0" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.10" transition-key="72:220:7:94eec9d7-eafd-457c-93ec-cfe7a5e45232" transition-magic="0:7;72:220:7:94eec9d7-eafd-457c-93ec-cfe7a5e45232" on_node="node1" call-id="121" rc-code="7" op-status="0" interval="0" last-run="1466676416" last-rc-change="1466676416" exec-time="119" queue-time="1" op-digest="f2317cad3d54cec5d7d7aa7d0bf35cf8"/>
[root@nas-210 ~]# 
[root@nas-210 ~]# 
[root@nas-210 ~]# pcs config | grep nas_samba
[root@nas-210 ~]# 
[root@nas-210 ~]# pcs resource delete nas_samba
Error: Resource 'nas_samba' does not exist.
[root@nas-210 ~]# 
[root@nas-210 ~]# 
[root@nas-210 ~]# 
[root@nas-210 ~]# crm_resource -C nas_samba
Waiting for 1 replies from the CRMd. OK
[root@nas-210 ~]# 
[root@nas-210 ~]# 
[root@nas-210 ~]# /usr/sbin/cibadmin -l -Q |grep nas_samba
[root@nas-210 ~]# 
[root@nas-210 ~]# 
[root@nas-210 ~]# 
[root@nas-210 ~]# pcs resource create nas_samba systemd:smb op monitor start-delay=10s interval=15s timeout=20s --group nas_group
[root@nas-210 ~]# 
[root@nas-210 ~]# 
[root@nas-210 ~]# pcs config | grep nas_samba
  Resource: nas_samba (class=systemd type=smb)
   Operations: monitor interval=15s start-delay=10s timeout=20s (nas_samba-monitor-interval-15s)
[root@nas-210 ~]# 
[root@nas-210 ~]# 
[root@nas-210 ~]#

Comment 4 Tomas Jelinek 2016-06-27 11:42:34 UTC

(In reply to Zhaoming Zhang from comment #3)
> It seems I meet the same problem in my cluster.
> Before the new version established，shall I use “crm_resource -C ” to avoid
> the problem?

Yes, that should do the trick. Alternatively you can use "pcs resource cleanup" command which runs "crm_resource -C" for you.

"crm_node --force -R" command is for a case when a node (not a resource) has been removed and cannot be added back because there are still traces of it in pacemaker.

Comment 5 Zhaoming Zhang 2016-06-29 08:17:06 UTC

(In reply to Tomas Jelinek from comment #4)
> (In reply to Zhaoming Zhang from comment #3)
> > It seems I meet the same problem in my cluster.
> > Before the new version established，shall I use “crm_resource -C ” to avoid
> > the problem?
> 
> Yes, that should do the trick. Alternatively you can use "pcs resource
> cleanup" command which runs "crm_resource -C" for you.
> 
> "crm_node --force -R" command is for a case when a node (not a resource) has
> been removed and cannot be added back because there are still traces of it
> in pacemaker.


Thanks a lot!

I meet the problem in a case like this:
1. In a two nodes cluster, eg. node0 and node1, I run "pcs cluster standy node1" and then run "poweroff" on node1.
2. Then I delete a resource and try to add the same name resource back.
Coincidentally, I meet the problem again.
3. Automatically，I use “/usr/sbin/cibadmin -l -Q”to check the traces and find the traces. And I use "crm_resource -C " to do the trick, but it won't work!!!
I try "crm_resource -C " several times, still won't work.
4. Then I turn on node1. After node1 runned the cluster, the traces disappeared! It just gone without a new command!


Would u please tell me what's the reasons of the command won't work  in step3 and the traces disappeared in step4?


Any info would be helpful, thanks!






[root@nas-220 ~]# pcs resource delete nas_nfs
Error: Resource 'nas_nfs' does not exist.
[root@nas-220 ~]# 
[root@nas-220 ~]# 
[root@nas-220 ~]# /usr/sbin/cibadmin -l -Q| grep nas_nfs
          <lrm_resource id="nas_nfs" type="nfsserver" class="ocf" provider="heartbeat">
            <lrm_rsc_op id="nas_nfs_last_failure_0" operation_key="nas_nfs_monitor_0" operation="monitor" crm-debug-origin="build_active_RAs" crm_feature_set="3.0.10" transition-key="24:745:7:94eec9d7-eafd-457c-93ec-cfe7a5e45232" transition-magic="0:0;24:745:7:94eec9d7-eafd-457c-93ec-cfe7a5e45232" on_node="node0" call-id="1210" rc-code="0" op-status="0" interval="0" last-run="1467102672" last-rc-change="1467102672" exec-time="370" queue-time="0" op-digest="8236642d60a6a43b6357038bd2cf15c7"/>
            <lrm_rsc_op id="nas_nfs_last_0" operation_key="nas_nfs_stop_0" operation="stop" crm-debug-origin="do_update_resource" crm_feature_set="3.0.10" transition-key="128:832:0:94eec9d7-eafd-457c-93ec-cfe7a5e45232" transition-magic="0:0;128:832:0:94eec9d7-eafd-457c-93ec-cfe7a5e45232" on_node="node0" call-id="1315" rc-code="0" op-status="0" interval="0" last-run="1467167131" last-rc-change="1467167131" exec-time="430" queue-time="0" op-digest="8236642d60a6a43b6357038bd2cf15c7"/>
            <lrm_rsc_op id="nas_nfs_monitor_15000" operation_key="nas_nfs_monitor_15000" operation="monitor" crm-debug-origin="build_active_RAs" crm_feature_set="3.0.10" transition-key="96:746:0:94eec9d7-eafd-457c-93ec-cfe7a5e45232" transition-magic="0:0;96:746:0:94eec9d7-eafd-457c-93ec-cfe7a5e45232" on_node="node0" call-id="1227" rc-code="0" op-status="0" interval="15000" last-rc-change="1467102682" exec-time="386" queue-time="10000" op-digest="cf9065dcbe3d8e10c2e27af5e9996ae4"/>
[root@nas-220 ~]#  crm_resource -C nas_nfs
Waiting for 1 replies from the CRMd. OK
[root@nas-220 ~]# 
[root@nas-220 ~]# 
[root@nas-220 ~]# 
[root@nas-220 ~]# 
[root@nas-220 ~]# 
[root@nas-220 ~]# 
[root@nas-220 ~]# /usr/sbin/cibadmin -l -Q| grep nas_nfs
          <lrm_resource id="nas_nfs" type="nfsserver" class="ocf" provider="heartbeat">
            <lrm_rsc_op id="nas_nfs_last_failure_0" operation_key="nas_nfs_monitor_0" operation="monitor" crm-debug-origin="build_active_RAs" crm_feature_set="3.0.10" transition-key="24:745:7:94eec9d7-eafd-457c-93ec-cfe7a5e45232" transition-magic="0:0;24:745:7:94eec9d7-eafd-457c-93ec-cfe7a5e45232" on_node="node0" call-id="1210" rc-code="0" op-status="0" interval="0" last-run="1467102672" last-rc-change="1467102672" exec-time="370" queue-time="0" op-digest="8236642d60a6a43b6357038bd2cf15c7"/>
            <lrm_rsc_op id="nas_nfs_last_0" operation_key="nas_nfs_stop_0" operation="stop" crm-debug-origin="do_update_resource" crm_feature_set="3.0.10" transition-key="128:832:0:94eec9d7-eafd-457c-93ec-cfe7a5e45232" transition-magic="0:0;128:832:0:94eec9d7-eafd-457c-93ec-cfe7a5e45232" on_node="node0" call-id="1315" rc-code="0" op-status="0" interval="0" last-run="1467167131" last-rc-change="1467167131" exec-time="430" queue-time="0" op-digest="8236642d60a6a43b6357038bd2cf15c7"/>
            <lrm_rsc_op id="nas_nfs_monitor_15000" operation_key="nas_nfs_monitor_15000" operation="monitor" crm-debug-origin="build_active_RAs" crm_feature_set="3.0.10" transition-key="96:746:0:94eec9d7-eafd-457c-93ec-cfe7a5e45232" transition-magic="0:0;96:746:0:94eec9d7-eafd-457c-93ec-cfe7a5e45232" on_node="node0" call-id="1227" rc-code="0" op-status="0" interval="15000" last-rc-change="1467102682" exec-time="386" queue-time="10000" op-digest="cf9065dcbe3d8e10c2e27af5e9996ae4"/>
[root@nas-220 ~]# 
[root@nas-220 ~]# 
[root@nas-220 ~]# 
[root@nas-220 ~]# /usr/sbin/cibadmin -l -Q| grep nas_nfs
[root@nas-220 ~]# 
[root@nas-220 ~]#

Comment 6 Tomas Jelinek 2016-07-25 16:15:02 UTC

Created attachment 1183868 [details]
proposed fix

Setup:
[root@rh72-node1:~]# pcs resource create d1 dummy
[root@rh72-node1:~]# crm_resource -F -r d1 -H rh72-node1
Waiting for 1 replies from the CRMd. OK
[root@rh72-node1:~]# crm_resource -F -r d1 -H rh72-node2
Waiting for 1 replies from the CRMd. OK
[root@rh72-node1:~]# pcs cluster standby rh72-node2
[root@rh72-node1:~]# pcs resource delete d1
Attempting to stop: d1...Stopped

Before fix:
[root@rh72-node1:~]# pcs resource create d1 dummy
Error: unable to create resource/fence device 'd1', 'd1' already exists on this system

After fix:
[root@rh72-node1:~]# pcs resource create d1 dummy
[root@rh72-node1:~]# pcs resource show d1
 Resource: d1 (class=ocf provider=heartbeat type=Dummy)
  Operations: start interval=0s timeout=20 (d1-start-interval-0s)
              stop interval=0s timeout=20 (d1-stop-interval-0s)
              monitor interval=10 timeout=20 (d1-monitor-interval-10)

Comment 7 Tomas Jelinek 2016-07-25 16:22:17 UTC

(In reply to Zhaoming Zhang from comment #5)
> 
> I meet the problem in a case like this:
> 1. In a two nodes cluster, eg. node0 and node1, I run "pcs cluster standy
> node1" and then run "poweroff" on node1.
> 2. Then I delete a resource and try to add the same name resource back.
> Coincidentally, I meet the problem again.
> 3. Automatically，I use “/usr/sbin/cibadmin -l -Q”to check the traces and
> find the traces. And I use "crm_resource -C " to do the trick, but it won't
> work!!!
> I try "crm_resource -C " several times, still won't work.
> 4. Then I turn on node1. After node1 runned the cluster, the traces
> disappeared! It just gone without a new command!
> 
> 
> Would u please tell me what's the reasons of the command won't work  in
> step3 and the traces disappeared in step4?
> 

Thank you for this additional report, it was very helpful.

Apparently pacemaker does not update status of offline and standby nodes. When you brought the node back online, its status got updated and that is why the resource traces disappeared automatically.

With the patch from comment6 pcs does not care about these traces any more and allows you to recreate the resource.

Comment 8 Ivan Devat 2016-07-28 18:45:16 UTC

Setup:
[vm-rhel72-1 ~] $ pcs resource create d1 dummy
[vm-rhel72-1 ~] $ crm_resource -F -r d1 -H vm-rhel72-1
Waiting for 1 replies from the CRMd. OK
[vm-rhel72-1 ~] $ crm_resource -F -r d1 -H vm-rhel72-3
Waiting for 1 replies from the CRMd. OK
[vm-rhel72-1 ~] $ pcs cluster standby vm-rhel72-3
[vm-rhel72-1 ~] $ pcs resource delete d1
Attempting to stop: d1...Stopped


Before Fix:
[vm-rhel72-1 ~] $ rpm -q pcs
pcs-0.9.152-4.el7.x86_64
[vm-rhel72-1 ~] $ pcs resource create d1 dummy
Error: unable to create resource/fence device 'd1', 'd1' already exists on this system

After Fix:
[vm-rhel72-1 ~] $ rpm -q pcs
pcs-0.9.152-5.el7.x86_64
[vm-rhel72-1 ~] $ pcs resource create d1 dummy
[vm-rhel72-1 ~] $ pcs resource
 d1     (ocf::heartbeat:Dummy): Started vm-rhel72-1

Comment 12 errata-xmlrpc 2016-11-03 20:56:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2596.html