Bug 1436696
| Summary: | pacemaker can catch systemd unit reload as not running | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Markus Frosch <markus.frosch> |
| Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> |
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 7.3 | CC: | abeekhof, cluster-maint, mnovacek |
| Target Milestone: | rc | ||
| Target Release: | 7.4 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | pacemaker-1.1.16-3.el7 | Doc Type: | Bug Fix |
| Doc Text: |
Cause: If a systemd-based cluster resource were in the "reloading" state when monitored by Pacemaker, Pacemaker would consider that to be a monitor failure.
Consequence: A systemd-based resource could be unnecessarily recovered by the cluster.
Fix: Pacemaker now considers "reloading" to be a successful monitor state for a systemd-based resource.
Result: Reloads of a systemd-based resource do not trigger recovery.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-08-01 17:54:39 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Markus Frosch
2017-03-28 13:00:47 UTC
Yes, actually the fix is already in the build planned for 7.4. QA: To test, create a dummy systemd service on all nodes with: yum install systemd-python cat > /usr/lib/systemd/system/bz1436696.service <<EOF [Unit] Description=BZ#1436696 Test Unit [Service] Type=notify ExecStart=/usr/bin/python -c 'import time, systemd.daemon; systemd.daemon.notify("READY=1"); time.sleep(86400)' ExecStop=/bin/sh -c '[ -n "$MAINPID" ] && kill -s KILL $MAINPID' ExecReload=/bin/sh -c 'sleep 10' EOF systemctl daemon-reload Then configure a resource: # pcs resource create bz1436696test systemd:bz1436696 op monitor interval=9s Then on the node running the service: # systemctl reload bz1436696 Before the fix, it should result in a monitor error; after the fix, there should be no error.
I have verified that reloading systemd unit at the time of monitor will not
cause monitoring action to fail in pacemaker-1.1.16-9.
-----
1/ Create new dummy systemctl resource which has reload action (sleep) taking
longer than the monitor interval of the resource [1]
2/ Create cluster with fencing and newly created systemd resource with shorter
monitor interval than the created dummy resource have [2]
3/ Start the resource
[root@virt-135 ~]# pcs resource
...
bz1436696test (systemd:bz1436696): Started virt-135
[root@virt-135 ~]# systemctl is-active bz1436696
active
before the patch (pacemaker-1.1.15-11.el7)
==========================================
[root@virt-135 ~]# systemctl reload bz1436696
Job for bz1436696.service canceled.
[root@virt-135 ~]# date
Wed May 24 17:19:15 CEST 2017
[root@virt-135 ~]# grep bz1436696test_monitor_9000 /var/log/cluster/corosync.log
May 24 17:18:35 [8480] virt-135 cib: info: cib_perform_op: + /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='bz1436696test']/lrm_rsc_op[@id='bz1436696test_monitor_9000']: @transition-key=5:18:0:415724a7-08e1-4dea-b809-56372ffe1866, @transition-magic=0:0;5:18:0:415724a7-08e1-4dea-b809-56372ffe1866, @call-id=93, @last-rc-change=1495639115, @exec-time=3, @queue-time=1
> May 24 17:19:20 [8485] virt-135 crmd: info: process_lrm_event: Result of monitor operation for bz1436696test on virt-135: 7 (not running) | call=93 key=bz1436696test_monitor_9000 confirmed=false cib-update=93
May 24 17:19:20 [8485] virt-135 crmd: info: process_lrm_event: Result of monitor operation for bz1436696test on virt-135: Cancelled | call=93 key=bz1436696test_monitor_9000 confirmed=true
after the patch (pacemaker-1.1.16-9.el7)
========================================
[root@virt-135 ~]# date
Wed May 24 17:26:47 CEST 2017
[root@virt-135 ~]# systemctl reload bz1436696
[root@virt-135 ~]# echo $?
0
[root@virt-135 ~]# grep bz1436696test_monitor_9000 /var/log/cluster/corosync.log
May 24 17:25:30 [31737] virt-135 crmd: notice: te_rsc_command: Initiating monitor operation bz1436696test_monitor_0 on virt-134 | action 15
May 24 17:25:30 [31732] virt-135 cib: info: cib_perform_op: ++ <lrm_rsc_op id="bz1436696test_last_0" operation_key="bz1436696test_monitor_0" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.12" transition-key="15:1:7:ffa92c87-7fe1-4557-bdcb-3b3610c343e7" transition-magic="0:7;15:1:7:ffa92c87-7fe1-4557-bdcb-3b3610c343e7" on_node="virt-134" call-id="30" rc-code="7" op-stat
May 24 17:25:30 [31737] virt-135 crmd: info: match_graph_event: Action bz1436696test_monitor_0 (15) confirmed on virt-134 (rc=7)
(no monitor failures after systemctl reload have been run)
---
(1) systemctl resource
[root@virt-135 ~]# yum -y install systemd-python
[root@virt-135 ~]# systemctl cat bz1436696
# /usr/lib/systemd/system/bz1436696.service
[Unit]
Description=BZ#1436696 Test Unit
[Service]
Type=notify
ExecStart=/usr/bin/python -c 'import time, systemd.daemon; systemd.daemon.notify("READY=1"); time.sleep(86400)'
ExecStop=/bin/sh -c '[ -n "" ] && kill -s KILL '
ExecReload=/bin/sh -c 'sleep 10'
# /run/systemd/system/bz1436696.service.d/50-pacemaker.conf
[Unit]
Description=Cluster Controlled bz1436696
Before=pacemaker.service
[Service]
Restart=no
(2) resource setup
[root@virt-135 ~]# pcs resource
Clone Set: dlm-clone [dlm]
Started: [ virt-134 virt-135 virt-136 ]
Clone Set: clvmd-clone [clvmd]
Started: [ virt-134 virt-135 virt-136 ]
bz1436696test (systemd:bz1436696): Started virt-135
[root@virt-135 ~]# pcs resource --full
Clone: dlm-clone
Meta Attrs: interleave=true ordered=true
Resource: dlm (class=ocf provider=pacemaker type=controld)
Operations: monitor interval=30s on-fail=fence (dlm-monitor-interval-30s)
start interval=0s timeout=90 (dlm-start-interval-0s)
stop interval=0s timeout=100 (dlm-stop-interval-0s)
Clone: clvmd-clone
Meta Attrs: interleave=true ordered=true
Resource: clvmd (class=ocf provider=heartbeat type=clvm)
Attributes: with_cmirrord=1
Operations: monitor interval=30s on-fail=fence (clvmd-monitor-interval-30s)
start interval=0s timeout=90 (clvmd-start-interval-0s)
stop interval=0s timeout=90 (clvmd-stop-interval-0s)
Resource: bz1436696test (class=systemd type=bz1436696)
Operations: monitor interval=9s (bz1436696test-monitor-interval-9s)
start interval=0s timeout=100 (bz1436696test-start-interval-0s)
stop interval=0s timeout=100 (bz1436696test-stop-interval-0s)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1862 |