Bug 1436696
Summary: | pacemaker can catch systemd unit reload as not running | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Markus Frosch <markus.frosch> |
Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> |
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.3 | CC: | abeekhof, cluster-maint, mnovacek |
Target Milestone: | rc | ||
Target Release: | 7.4 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | pacemaker-1.1.16-3.el7 | Doc Type: | Bug Fix |
Doc Text: |
Cause: If a systemd-based cluster resource were in the "reloading" state when monitored by Pacemaker, Pacemaker would consider that to be a monitor failure.
Consequence: A systemd-based resource could be unnecessarily recovered by the cluster.
Fix: Pacemaker now considers "reloading" to be a successful monitor state for a systemd-based resource.
Result: Reloads of a systemd-based resource do not trigger recovery.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2017-08-01 17:54:39 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Markus Frosch
2017-03-28 13:00:47 UTC
Yes, actually the fix is already in the build planned for 7.4. QA: To test, create a dummy systemd service on all nodes with: yum install systemd-python cat > /usr/lib/systemd/system/bz1436696.service <<EOF [Unit] Description=BZ#1436696 Test Unit [Service] Type=notify ExecStart=/usr/bin/python -c 'import time, systemd.daemon; systemd.daemon.notify("READY=1"); time.sleep(86400)' ExecStop=/bin/sh -c '[ -n "$MAINPID" ] && kill -s KILL $MAINPID' ExecReload=/bin/sh -c 'sleep 10' EOF systemctl daemon-reload Then configure a resource: # pcs resource create bz1436696test systemd:bz1436696 op monitor interval=9s Then on the node running the service: # systemctl reload bz1436696 Before the fix, it should result in a monitor error; after the fix, there should be no error. I have verified that reloading systemd unit at the time of monitor will not cause monitoring action to fail in pacemaker-1.1.16-9. ----- 1/ Create new dummy systemctl resource which has reload action (sleep) taking longer than the monitor interval of the resource [1] 2/ Create cluster with fencing and newly created systemd resource with shorter monitor interval than the created dummy resource have [2] 3/ Start the resource [root@virt-135 ~]# pcs resource ... bz1436696test (systemd:bz1436696): Started virt-135 [root@virt-135 ~]# systemctl is-active bz1436696 active before the patch (pacemaker-1.1.15-11.el7) ========================================== [root@virt-135 ~]# systemctl reload bz1436696 Job for bz1436696.service canceled. [root@virt-135 ~]# date Wed May 24 17:19:15 CEST 2017 [root@virt-135 ~]# grep bz1436696test_monitor_9000 /var/log/cluster/corosync.log May 24 17:18:35 [8480] virt-135 cib: info: cib_perform_op: + /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='bz1436696test']/lrm_rsc_op[@id='bz1436696test_monitor_9000']: @transition-key=5:18:0:415724a7-08e1-4dea-b809-56372ffe1866, @transition-magic=0:0;5:18:0:415724a7-08e1-4dea-b809-56372ffe1866, @call-id=93, @last-rc-change=1495639115, @exec-time=3, @queue-time=1 > May 24 17:19:20 [8485] virt-135 crmd: info: process_lrm_event: Result of monitor operation for bz1436696test on virt-135: 7 (not running) | call=93 key=bz1436696test_monitor_9000 confirmed=false cib-update=93 May 24 17:19:20 [8485] virt-135 crmd: info: process_lrm_event: Result of monitor operation for bz1436696test on virt-135: Cancelled | call=93 key=bz1436696test_monitor_9000 confirmed=true after the patch (pacemaker-1.1.16-9.el7) ======================================== [root@virt-135 ~]# date Wed May 24 17:26:47 CEST 2017 [root@virt-135 ~]# systemctl reload bz1436696 [root@virt-135 ~]# echo $? 0 [root@virt-135 ~]# grep bz1436696test_monitor_9000 /var/log/cluster/corosync.log May 24 17:25:30 [31737] virt-135 crmd: notice: te_rsc_command: Initiating monitor operation bz1436696test_monitor_0 on virt-134 | action 15 May 24 17:25:30 [31732] virt-135 cib: info: cib_perform_op: ++ <lrm_rsc_op id="bz1436696test_last_0" operation_key="bz1436696test_monitor_0" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.12" transition-key="15:1:7:ffa92c87-7fe1-4557-bdcb-3b3610c343e7" transition-magic="0:7;15:1:7:ffa92c87-7fe1-4557-bdcb-3b3610c343e7" on_node="virt-134" call-id="30" rc-code="7" op-stat May 24 17:25:30 [31737] virt-135 crmd: info: match_graph_event: Action bz1436696test_monitor_0 (15) confirmed on virt-134 (rc=7) (no monitor failures after systemctl reload have been run) --- (1) systemctl resource [root@virt-135 ~]# yum -y install systemd-python [root@virt-135 ~]# systemctl cat bz1436696 # /usr/lib/systemd/system/bz1436696.service [Unit] Description=BZ#1436696 Test Unit [Service] Type=notify ExecStart=/usr/bin/python -c 'import time, systemd.daemon; systemd.daemon.notify("READY=1"); time.sleep(86400)' ExecStop=/bin/sh -c '[ -n "" ] && kill -s KILL ' ExecReload=/bin/sh -c 'sleep 10' # /run/systemd/system/bz1436696.service.d/50-pacemaker.conf [Unit] Description=Cluster Controlled bz1436696 Before=pacemaker.service [Service] Restart=no (2) resource setup [root@virt-135 ~]# pcs resource Clone Set: dlm-clone [dlm] Started: [ virt-134 virt-135 virt-136 ] Clone Set: clvmd-clone [clvmd] Started: [ virt-134 virt-135 virt-136 ] bz1436696test (systemd:bz1436696): Started virt-135 [root@virt-135 ~]# pcs resource --full Clone: dlm-clone Meta Attrs: interleave=true ordered=true Resource: dlm (class=ocf provider=pacemaker type=controld) Operations: monitor interval=30s on-fail=fence (dlm-monitor-interval-30s) start interval=0s timeout=90 (dlm-start-interval-0s) stop interval=0s timeout=100 (dlm-stop-interval-0s) Clone: clvmd-clone Meta Attrs: interleave=true ordered=true Resource: clvmd (class=ocf provider=heartbeat type=clvm) Attributes: with_cmirrord=1 Operations: monitor interval=30s on-fail=fence (clvmd-monitor-interval-30s) start interval=0s timeout=90 (clvmd-start-interval-0s) stop interval=0s timeout=90 (clvmd-stop-interval-0s) Resource: bz1436696test (class=systemd type=bz1436696) Operations: monitor interval=9s (bz1436696test-monitor-interval-9s) start interval=0s timeout=100 (bz1436696test-start-interval-0s) stop interval=0s timeout=100 (bz1436696test-stop-interval-0s) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1862 |