Bug 1499796
Summary: | Concurrent 'pcs resource disable ...' doesn't always disable all mentioned resources. | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Ondrej Faměra <ofamera> | ||||
Component: | pcs | Assignee: | Tomas Jelinek <tojeline> | ||||
Status: | CLOSED DUPLICATE | QA Contact: | cluster-qe <cluster-qe> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 7.4 | CC: | cfeist, cluster-maint, idevat, omular, tojeline | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2017-10-12 08:56:23 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
I believe this is already fixed by bz1441673. The fix only applies to pcs commands overhauled to the new architecture. However, "pcs resource disable" and "pcs resource enable" have been overhauled in https://github.com/ClusterLabs/pcs/commit/0c4cba8bdb69c8984cd84a0769c35dc52cdc5be6 which has been included in pcs 0.9.157 and never. Note that using commands which are not yet overhauled may still trigger the described race condition. Possible workaround is to edit the CIB in a file and push a diff of changes to the cluster. This feature has been implemented and described in bz1404233. Tested with test package from bz1441673 on my setup and it now behaves correctly. Should we mark this as duplicate of bz1441673? I have tested both 'pcs resource enable' and 'pcs resource disable'. Also from logs it looks more the way I would expect it ('pcs resource disable' log below) Oct 12 09:29:03 [1228] fastvm-rhel-7-4-183 cib: info: cib_perform_op: Diff: --- 0.41.7 2 Oct 12 09:29:03 [1228] fastvm-rhel-7-4-183 cib: info: cib_perform_op: Diff: +++ 0.42.0 (null) Oct 12 09:29:03 [1228] fastvm-rhel-7-4-183 cib: info: cib_perform_op: + /cib: @epoch=42, @num_updates=0 Oct 12 09:29:03 [1228] fastvm-rhel-7-4-183 cib: info: cib_perform_op: ++ /cib/configuration/resources/primitive[@id='test_2']: <meta_attributes id="test_2-meta_attributes"/> Oct 12 09:29:03 [1228] fastvm-rhel-7-4-183 cib: info: cib_perform_op: ++ <nvpair id="test_2-meta_attributes-target-role" name="target-role" value="Stopped"/> Oct 12 09:29:03 [1228] fastvm-rhel-7-4-183 cib: info: cib_perform_op: ++ </meta_attributes> Oct 12 09:29:03 [1228] fastvm-rhel-7-4-183 cib: info: cib_process_request: Completed cib_apply_diff operation for section 'all': OK (rc=0, origin=fastvm-rhel-7-4-184/cibadmin/2, version=0.42.0) Oct 12 09:29:03 [1228] fastvm-rhel-7-4-183 cib: info: cib_perform_op: Diff: --- 0.42.0 2 Oct 12 09:29:03 [1228] fastvm-rhel-7-4-183 cib: info: cib_perform_op: Diff: +++ 0.43.0 (null) Oct 12 09:29:03 [1228] fastvm-rhel-7-4-183 cib: info: cib_perform_op: + /cib: @epoch=43 Oct 12 09:29:03 [1228] fastvm-rhel-7-4-183 cib: info: cib_perform_op: ++ /cib/configuration/resources/primitive[@id='test_1']: <meta_attributes id="test_1-meta_attributes"/> Oct 12 09:29:03 [1228] fastvm-rhel-7-4-183 cib: info: cib_perform_op: ++ <nvpair id="test_1-meta_attributes-target-role" name="target-role" value="Stopped"/> Oct 12 09:29:03 [1228] fastvm-rhel-7-4-183 cib: info: cib_perform_op: ++ </meta_attributes> Oct 12 09:29:03 [1228] fastvm-rhel-7-4-183 cib: info: cib_process_request: Completed cib_apply_diff operation for section 'all': OK (rc=0, origin=fastvm-rhel-7-4-183/cibadmin/2, version=0.43.0) Closing as proposed in comment 7. *** This bug has been marked as a duplicate of bug 1441673 *** |
Created attachment 1336378 [details] corosync.log from reproducer === Description of problem: Running 'pcs resource disable' on 2 nodes or in 2 instances at _same_ time causes the one of the commands to be not honoured. Both commands will exit with exit code 0. === Version-Release number of selected component (if applicable): pcs-0.9.158-6.el7 === How reproducible: Need to achieve concurrent run of 'pcs resource disable' commands either on one node or multiple nodes within 1 second (or less). === How to reproduce: Several ways on how to reproduce - CRON with commands at same time or with Ansible playbook as shown below. 1. create hosts file and playbook # hosts [cluster] 192.168.11.183 res1=test_1 192.168.11.184 res1=test_2 # disable_resources.yml --- - hosts: cluster remote_user: root tasks: - name: disable resource set 1 command: "pcs resource disable {{ res1 }}" 2. Run the playbook and observe what will happen when 2 nodes at same time will attempt to 'disable' a different resources. # ansible-playbook -i hosts disable_resources.yml ... changed: [192.168.11.183] => {"changed": true, "cmd": ["pcs", "resource", "disable", "test_1"], "delta": "0:00:00.142540", "end": "2017-10-05 15:56:35.588529", "rc": 0, "start": "2017-10-05 15:56:35.445989", "stderr": "", "stdout": "", "stdout_lines": [], "warnings": []} changed: [192.168.11.184] => {"changed": true, "cmd": ["pcs", "resource", "disable", "test_2"], "delta": "0:00:00.159837", "end": "2017-10-05 15:56:35.606808", "rc": 0, "start": "2017-10-05 15:56:35.446971", "stderr": "", "stdout": "", "stdout_lines": [], "warnings": []} ... === Expected result: Resources 'test_1' and 'test_2' will be disabled after the ansible playbook run. === Actual result: Only one resource is disabled. === Additional information: Using 'pcs resource disable ...' naming all resources to be disabled is current workaround in case of one node. If 2 nodes needs to execute 'pcs resource disable' at same time (via cron) we need to introduce the delay (sleep) on one of the nodes to avoid the situation. Below are relevant logs from above test with comments pcs status (before) test_1 (ocf::pacemaker:Dummy): Started fastvm-rhel-7-4-183 test_2 (ocf::pacemaker:Dummy): Started fastvm-rhel-7-4-184 Oct 05 15:56:34 [13046] fastvm-rhel-7-4-183 cib: info: cib_process_ping: Reporting our current digest to fastvm-rhel-7-4-184: 8c974c6c1dc80ec1ef989043c300aeed for 0.35.74 (0x5631e8681d20 0) Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183 cib: info: cib_process_request: Forwarding cib_replace operation for section configuration to all (origin=local/cibadmin/2) Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183 cib: info: cib_perform_op: Diff: --- 0.35.74 2 Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183 cib: info: cib_perform_op: Diff: +++ 0.36.0 (null) Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183 cib: info: cib_perform_op: + /cib: @epoch=36, @num_updates=0 Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183 cib: info: cib_perform_op: ++ /cib/configuration/resources/primitive[@id='test_1']: <meta_attributes id="test_1-meta_attributes"/> Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183 cib: info: cib_perform_op: ++ <nvpair id="test_1-meta_attributes-target-role" name="target-role" value="Stopped"/> Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183 cib: info: cib_perform_op: ++ </meta_attributes> ## ^^ here we seem to add lines defining the resource 'test_1' to be in disabled state (target-role: Stopped) Oct 05 15:56:35 [13048] fastvm-rhel-7-4-183 lrmd: info: cancel_recurring_action: Cancelling ocf operation test_1_monitor_10000 Oct 05 15:56:35 [13051] fastvm-rhel-7-4-183 crmd: info: do_lrm_rsc_op: Performing key=36:74:0:0d6ef819-c59a-430c-9999-60c999364659 op=test_1_stop_0 Oct 05 15:56:35 [13048] fastvm-rhel-7-4-183 lrmd: info: log_execute: executing - rsc:test_1 action:stop call_id:343 ## ^^ here we can see that cluster is actually stopping the resource to reach the disabled state Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183 cib: info: cib_perform_op: Diff: --- 0.36.0 2 Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183 cib: info: cib_perform_op: Diff: +++ 0.37.0 (null) Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183 cib: info: cib_perform_op: -- /cib/configuration/resources/primitive[@id='test_1']/meta_attributes[@id='test_1-meta_attributes'] Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183 cib: info: cib_perform_op: + /cib: @epoch=37 Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183 cib: info: cib_perform_op: ++ /cib/configuration/resources/primitive[@id='test_2']: <meta_attributes id="test_2-meta_attributes"/> Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183 cib: info: cib_perform_op: ++ <nvpair id="test_2-meta_attributes-target-role" name="target-role" value="Stopped"/> Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183 cib: info: cib_perform_op: ++ </meta_attributes> ## ^^ here we see a second request where the 'test_2' resource should be put into disabled state (target-role: Stopped). Here we also see that we will be replacing line in cib XML with "primitive[@id='test_1']/meta_attributes[@id='test_1-meta_attributes']". However in that part of CIB XML we have also the definition that 'test_1' should be disabled. Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183 cib: info: cib_perform_op: Diff: --- 0.37.0 2 Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183 cib: info: cib_perform_op: Diff: +++ 0.37.1 (null) Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183 cib: info: cib_perform_op: + /cib: @num_updates=1 Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183 cib: info: cib_perform_op: + /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='test_1']/lrm_rsc_op[@id='test_1_last_0']: @operation_key=test_1_stop_0, @operation=stop, @transition-key=36:74:0:0d6ef819-c59a-430c-9999-60c999364659, @transition-magic=0:0;36:74:0:0d6ef819-c59a-430c-9999-60c999364659, @call-id=343, @last-run=1507211795, @last-rc-change=1507211795, @exec-time=17 Oct 05 15:56:35 [13046] fastvm-rhel-7-4-183 cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=fastvm-rhel-7-4-183/crmd/221, version=0.37.1) Oct 05 15:56:35 [13051] fastvm-rhel-7-4-183 crmd: info: do_lrm_rsc_op: Performing key=36:75:0:0d6ef819-c59a-430c-9999-60c999364659 op=test_1_start_0 Oct 05 15:56:35 [13048] fastvm-rhel-7-4-183 lrmd: info: log_execute: executing - rsc:test_1 action:start call_id:346 ## ^^ as the test_1 is not in disabled state, the cluster will attempt to start it. pcs status (after) test_1 (ocf::pacemaker:Dummy): Started fastvm-rhel-7-4-183 test_2 (ocf::pacemaker:Dummy): Stopped (disabled)