Bug 1499796

Summary: Concurrent 'pcs resource disable ...' doesn't always disable all mentioned resources.
Product: Red Hat Enterprise Linux 7 Reporter: Ondrej Faměra <ofamera>
Component: pcsAssignee: Tomas Jelinek <tojeline>
Status: CLOSED DUPLICATE QA Contact: cluster-qe <cluster-qe>
Severity: high Docs Contact:
Priority: high    
Version: 7.4CC: cfeist, cluster-maint, idevat, omular, tojeline
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-10-12 08:56:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
corosync.log from reproducer none

Description Ondrej Faměra 2017-10-09 12:13:43 UTC
Created attachment 1336378 [details]
corosync.log from reproducer

=== Description of problem:
Running 'pcs resource disable' on 2 nodes or in 2 instances at _same_ time causes the one of the commands to be not honoured. Both commands will exit with exit code 0.

=== Version-Release number of selected component (if applicable):
pcs-0.9.158-6.el7

=== How reproducible:
Need to achieve concurrent run of 'pcs resource disable' commands either on one node or multiple nodes within 1 second (or less).

=== How to reproduce:
Several ways on how to reproduce - CRON with commands at same time or with Ansible playbook as shown below.

1. create hosts file and playbook
# hosts
[cluster]
192.168.11.183 res1=test_1
192.168.11.184 res1=test_2

# disable_resources.yml
---  
- hosts: cluster  
  remote_user: root  
  tasks:  
  - name: disable resource set 1
    command: "pcs resource disable {{ res1 }}"

2. Run the playbook and observe what will happen when 2 nodes at same time will attempt to 'disable' a different resources.
# ansible-playbook -i hosts disable_resources.yml
...
  changed: [192.168.11.183] => {"changed": true, "cmd": ["pcs", "resource", "disable", "test_1"], "delta": "0:00:00.142540", "end": "2017-10-05 15:56:35.588529", "rc": 0, "start": "2017-10-05 15:56:35.445989", "stderr": "", "stdout": "", "stdout_lines": [], "warnings": []}
  changed: [192.168.11.184] => {"changed": true, "cmd": ["pcs", "resource", "disable", "test_2"], "delta": "0:00:00.159837", "end": "2017-10-05 15:56:35.606808", "rc": 0, "start": "2017-10-05 15:56:35.446971", "stderr": "", "stdout": "", "stdout_lines": [], "warnings": []}
...

=== Expected result:
  Resources 'test_1' and 'test_2' will be disabled after the ansible playbook run.

=== Actual result:
  Only one resource is disabled.


=== Additional information:
  Using 'pcs resource disable ...' naming all resources to be disabled is current workaround in case of one node.
  If 2 nodes needs to execute 'pcs resource disable' at same time (via cron) we need to introduce the delay (sleep)
  on one of the nodes to avoid the situation.

Below are relevant logs from above test with comments

pcs status (before)
 test_1	(ocf::pacemaker:Dummy):	Started fastvm-rhel-7-4-183
 test_2	(ocf::pacemaker:Dummy):	Started fastvm-rhel-7-4-184

Oct 05 15:56:34 [13046] fastvm-rhel-7-4-183        cib:     info: cib_process_ping:	Reporting our current digest to fastvm-rhel-7-4-184: 8c974c6c1dc80ec1ef989043c300aeed for 0.35.74 (0x5631e8681d20 0)
Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183        cib:     info: cib_process_request:	Forwarding cib_replace operation for section configuration to all (origin=local/cibadmin/2)
Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183        cib:     info: cib_perform_op:	Diff: --- 0.35.74 2
Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183        cib:     info: cib_perform_op:	Diff: +++ 0.36.0 (null)
Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183        cib:     info: cib_perform_op:	+  /cib:  @epoch=36, @num_updates=0
Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183        cib:     info: cib_perform_op:	++ /cib/configuration/resources/primitive[@id='test_1']:  <meta_attributes id="test_1-meta_attributes"/>
Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183        cib:     info: cib_perform_op:	++                                                          <nvpair id="test_1-meta_attributes-target-role" name="target-role" value="Stopped"/>
Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183        cib:     info: cib_perform_op:	++                                                        </meta_attributes>
## ^^ here we seem to add lines defining the resource 'test_1' to be in disabled state (target-role: Stopped)

Oct 05 15:56:35 [13048] fastvm-rhel-7-4-183       lrmd:     info: cancel_recurring_action:	Cancelling ocf operation test_1_monitor_10000
Oct 05 15:56:35 [13051] fastvm-rhel-7-4-183       crmd:     info: do_lrm_rsc_op:	Performing key=36:74:0:0d6ef819-c59a-430c-9999-60c999364659 op=test_1_stop_0
Oct 05 15:56:35 [13048] fastvm-rhel-7-4-183       lrmd:     info: log_execute:	executing - rsc:test_1 action:stop call_id:343
## ^^ here we can see that cluster is actually stopping the resource to reach the disabled state

Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183        cib:     info: cib_perform_op:	Diff: --- 0.36.0 2
Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183        cib:     info: cib_perform_op:	Diff: +++ 0.37.0 (null)
Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183        cib:     info: cib_perform_op:	-- /cib/configuration/resources/primitive[@id='test_1']/meta_attributes[@id='test_1-meta_attributes']
Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183        cib:     info: cib_perform_op:	+  /cib:  @epoch=37
Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183        cib:     info: cib_perform_op:	++ /cib/configuration/resources/primitive[@id='test_2']:  <meta_attributes id="test_2-meta_attributes"/>
Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183        cib:     info: cib_perform_op:	++                                                          <nvpair id="test_2-meta_attributes-target-role" name="target-role" value="Stopped"/>
Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183        cib:     info: cib_perform_op:	++                                                        </meta_attributes>
## ^^ here we see a second request where the 'test_2' resource should be put into disabled state (target-role: Stopped). Here we also see that we will be replacing line in cib XML with "primitive[@id='test_1']/meta_attributes[@id='test_1-meta_attributes']".
However in that part of CIB XML we have also the definition that 'test_1' should be disabled.

Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183        cib:     info: cib_perform_op:	Diff: --- 0.37.0 2
Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183        cib:     info: cib_perform_op:	Diff: +++ 0.37.1 (null)
Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183        cib:     info: cib_perform_op:	+  /cib:  @num_updates=1
Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183        cib:     info: cib_perform_op:	+  /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='test_1']/lrm_rsc_op[@id='test_1_last_0']:  @operation_key=test_1_stop_0, @operation=stop, @transition-key=36:74:0:0d6ef819-c59a-430c-9999-60c999364659, @transition-magic=0:0;36:74:0:0d6ef819-c59a-430c-9999-60c999364659, @call-id=343, @last-run=1507211795, @last-rc-change=1507211795, @exec-time=17
Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183        cib:     info: cib_process_request:	Completed cib_modify operation for section status: OK (rc=0, origin=fastvm-rhel-7-4-183/crmd/221, version=0.37.1)
Oct 05 15:56:35 [13051] fastvm-rhel-7-4-183       crmd:     info: do_lrm_rsc_op:	Performing key=36:75:0:0d6ef819-c59a-430c-9999-60c999364659 op=test_1_start_0
Oct 05 15:56:35 [13048] fastvm-rhel-7-4-183       lrmd:     info: log_execute:	executing - rsc:test_1 action:start call_id:346
## ^^ as the test_1 is not in disabled state, the cluster will attempt to start it.

pcs status (after)
 test_1	(ocf::pacemaker:Dummy):	Started fastvm-rhel-7-4-183
 test_2	(ocf::pacemaker:Dummy):	Stopped (disabled)

Comment 3 Tomas Jelinek 2017-10-09 12:51:49 UTC
I believe this is already fixed by bz1441673.

The fix only applies to pcs commands overhauled to the new architecture. However, "pcs resource disable" and "pcs resource enable" have been overhauled in https://github.com/ClusterLabs/pcs/commit/0c4cba8bdb69c8984cd84a0769c35dc52cdc5be6 which has been included in pcs 0.9.157 and never.

Note that using commands which are not yet overhauled may still trigger the described race condition.

Possible workaround is to edit the CIB in a file and push a diff of changes to the cluster. This feature has been implemented and described in bz1404233.

Comment 7 Ondrej Faměra 2017-10-12 07:49:01 UTC
Tested with test package from bz1441673 on my setup and it now behaves correctly.
Should we mark this as duplicate of bz1441673?

I have tested both 'pcs resource enable' and 'pcs resource disable'.

Also from logs it looks more the way I would expect it ('pcs resource disable' log below)

Oct 12 09:29:03 [1228] fastvm-rhel-7-4-183        cib:     info: cib_perform_op:        Diff: --- 0.41.7 2
Oct 12 09:29:03 [1228] fastvm-rhel-7-4-183        cib:     info: cib_perform_op:        Diff: +++ 0.42.0 (null)
Oct 12 09:29:03 [1228] fastvm-rhel-7-4-183        cib:     info: cib_perform_op:        +  /cib:  @epoch=42, @num_updates=0
Oct 12 09:29:03 [1228] fastvm-rhel-7-4-183        cib:     info: cib_perform_op:        ++ /cib/configuration/resources/primitive[@id='test_2']:  <meta_attributes id="test_2-meta_attributes"/>
Oct 12 09:29:03 [1228] fastvm-rhel-7-4-183        cib:     info: cib_perform_op:        ++                                                          <nvpair id="test_2-meta_attributes-target-role" name="target-role" value="Stopped"/>
Oct 12 09:29:03 [1228] fastvm-rhel-7-4-183        cib:     info: cib_perform_op:        ++                                                        </meta_attributes>
Oct 12 09:29:03 [1228] fastvm-rhel-7-4-183        cib:     info: cib_process_request:   Completed cib_apply_diff operation for section 'all': OK (rc=0, origin=fastvm-rhel-7-4-184/cibadmin/2, version=0.42.0)

Oct 12 09:29:03 [1228] fastvm-rhel-7-4-183        cib:     info: cib_perform_op:        Diff: --- 0.42.0 2
Oct 12 09:29:03 [1228] fastvm-rhel-7-4-183        cib:     info: cib_perform_op:        Diff: +++ 0.43.0 (null)
Oct 12 09:29:03 [1228] fastvm-rhel-7-4-183        cib:     info: cib_perform_op:        +  /cib:  @epoch=43
Oct 12 09:29:03 [1228] fastvm-rhel-7-4-183        cib:     info: cib_perform_op:        ++ /cib/configuration/resources/primitive[@id='test_1']:  <meta_attributes id="test_1-meta_attributes"/>
Oct 12 09:29:03 [1228] fastvm-rhel-7-4-183        cib:     info: cib_perform_op:        ++                                                          <nvpair id="test_1-meta_attributes-target-role" name="target-role" value="Stopped"/>
Oct 12 09:29:03 [1228] fastvm-rhel-7-4-183        cib:     info: cib_perform_op:        ++                                                        </meta_attributes>
Oct 12 09:29:03 [1228] fastvm-rhel-7-4-183        cib:     info: cib_process_request:   Completed cib_apply_diff operation for section 'all': OK (rc=0, origin=fastvm-rhel-7-4-183/cibadmin/2, version=0.43.0)

Comment 8 Tomas Jelinek 2017-10-12 08:56:23 UTC
Closing as proposed in comment 7.

*** This bug has been marked as a duplicate of bug 1441673 ***