Hide Forgot
+++ This bug was initially created as a clone of Bug #1360234 +++ Description of problem: After putting the cluster in maintanance mode, then reboot the nodes and resetting the maintenance status sometimes the ClusterMon resource keeps running even if "crm_mon" is NOT running. Normally ClusterMon should detect the missing "crm_mon" in reliable way to report the node as "bad" The mechanism used by "ClusterMon" script delivered in "/usr/lib/ocf/resource.d/pacemaker/ClusterMon": The script checks the content of the PID file created by "crm_mon" and tries to send a signal ("kill -s 0 <PID>" ) to this PID. If any process is running with this PID the resource agent script reports that "crm_mon" is running. But this can be any process running with this PID (e.g. after the reboot of the machine this could be another daemon process) In this case the reported value of the agent is wrong. Actual implementation: ClusterMon_monitor() { if [ -f $OCF_RESKEY_pidfile ]; then pid=`cat $OCF_RESKEY_pidfile` if [ ! -z $pid ]; then kill -s 0 $pid >/dev/null 2>&1; rc=$? case $rc in 0) exit $OCF_SUCCESS;; 1) exit $OCF_NOT_RUNNING;; *) exit $OCF_ERR_GENERIC;; esac fi fi exit $OCF_NOT_RUNNING } So sometimes the "kill -s 0 $pid" command returns "success" even if "crm_mon" is not running Version-Release number of selected component (if applicable): pacemaker-1.1.13-10.el7_2.4.x86_64 How reproducible: Problem depends on the actual running processes and their PIDs. Steps to Reproduce: 1. Set ClusterMon resource 2. Put node into maintenance mode 3. Reboot node 4. Either find way to force PID, originally assigned to original crm_mon PID, to other process (or try couple times so it happens randomly) 5. Disable maintenance mode Actual results: If PID of crm_mon changes after reboot the ClusterMon_monitor still checks the old PID and can kill other process instead Expected results: ClusterMon_monitor checks PID of actual crm_mon (not the old one) Additional info: Customer proposed following patch: ClusterMon_monitor() { if [ -f $OCF_RESKEY_pidfile ]; then pid=`cat $OCF_RESKEY_pidfile` if [ ! -z $pid ]; then kill -s 0 $pid >/dev/null 2>&1 \ && ps -fp $pid | grep $OCF_RESKEY_pidfile >/dev/null 2>&1 rc=$? case $rc in 0) exit $OCF_SUCCESS;; 1) exit $OCF_NOT_RUNNING;; *) exit $OCF_ERR_GENERIC;; esac fi fi exit $OCF_NOT_RUNNING This will filter out if the "crm_mon" with the reported PID is running. (but it still might kill other process) --- Additional comment from Oyvind Albrigtsen on 2016-09-23 12:11:09 CEST --- Tested and working patch: https://github.com/ClusterLabs/pacemaker/pull/1147
*** Bug 1385753 has been marked as a duplicate of this bug. ***
Fixed upstream by commit 7b303943
I have verified that ClusterMon resource agent is correctly recognized as failed in paceamker-1.1.16-9 --- Common setup: * configure cluster with fencing and ClusterMon resource [1] before the fix (pacemaker pacemaker-1.1.15-11.el7.x86_64) ========================================================= [root@virt-136 ~]# pcs resource ... cmon (ocf::pacemaker:ClusterMon): Started virt-136 [root@virt-136 ~]# ps axf | grep crm_mon 15327 pts/0 S+ 0:00 \_ grep --color=auto crm_mon 15284 ? S 0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_cmon.pid -d -i 15 -h /tmp/ClusterMon_cmon.html [root@virt-136 ~]# cat /tmp/ClusterMon_cmon.pid 15284 G [root@virt-136 ~]# pcs node maintenance virt-136 [root@virt-136 ~]# pcs resource ... cmon (ocf::pacemaker:ClusterMon): Started virt-136 (unmanaged) [root@virt-136 ~]# kill -9 15284 [root@virt-136 ~]# echo 1 > /tmp/ClusterMon_cmon.pid [root@virt-136 ~]# pcs node unmaintenance virt-136 [root@virt-136 ~]# pcs resource debug-monitor cmon Operation monitor for cmon (ocf:pacemaker:ClusterMon) returned 0 [root@virt-136 ~]# pcs resource ... cmon (ocf::pacemaker:ClusterMon): Started virt-136 [root@virt-136 ~]# ps axf | grep crm_mon 15546 pts/0 S+ 0:00 \_ grep --color=auto crm_mon after the fix (pacemaker-1.1.16-9.el7.x86_64) ============================================= [root@virt-136 ~]# pcs resource cmon (ocf::pacemaker:ClusterMon): Started virt-136 [root@virt-136 ~]# ps axf | grep crm_mon 10637 pts/0 S+ 0:00 \_ grep --color=auto crm_mon 10570 ? S 0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_cmon.pid -d -i 15 -h /tmp/ClusterMon_cmon.html [root@virt-136 ~]# cat /tmp/ClusterMon_cmon.pid 10570 [root@virt-136 ~]# pcs node maintenance virt-136 [root@virt-136 ~]# kill -9 10570 [root@virt-136 ~]# echo 1 > /tmp/ClusterMon_cmon.pid [root@virt-136 ~]# pcs node unmaintenance virt-136 [root@virt-136 ~]# pcs resource debug-monitor cmon Operation monitor for cmon (ocf:pacemaker:ClusterMon) returned 0 [root@virt-136 ~]# pcs resource cmon (ocf::pacemaker:ClusterMon): Started virt-136 [root@virt-136 ~]# ps axf | grep crm_mon 10783 pts/0 S+ 0:00 \_ grep --color=auto crm_mon 10743 ? S 0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_cmon.pid -d -i 15 -h /tmp/ClusterMon_cmon.html ----- (1) pcs config [root@virt-136 ~]# pcs config Cluster Name: STSRHTS2420 Corosync Nodes: virt-134 virt-135 virt-136 Pacemaker Nodes: virt-134 virt-135 virt-136 Resources: Clone: dlm-clone Meta Attrs: interleave=true ordered=true Resource: dlm (class=ocf provider=pacemaker type=controld) Operations: monitor interval=30s on-fail=fence (dlm-monitor-interval-30s) start interval=0s timeout=90 (dlm-start-interval-0s) stop interval=0s timeout=100 (dlm-stop-interval-0s) Clone: clvmd-clone Meta Attrs: interleave=true ordered=true Resource: clvmd (class=ocf provider=heartbeat type=clvm) Attributes: with_cmirrord=1 Operations: monitor interval=30s on-fail=fence (clvmd-monitor-interval-30s) start interval=0s timeout=90 (clvmd-start-interval-0s) stop interval=0s timeout=90 (clvmd-stop-interval-0s) Resource: cmon (class=ocf provider=pacemaker type=ClusterMon) Operations: monitor interval=10 timeout=20 (cmon-monitor-interval-10) start interval=0s timeout=20 (cmon-start-interval-0s) stop interval=0s timeout=20 (cmon-stop-interval-0s) Stonith Devices: Resource: fence-virt-134 (class=stonith type=fence_xvm) Attributes: pcmk_host_check=static-list pcmk_host_list=virt-134 pcmk_host_map=virt-134:virt-134.cluster-qe.lab.eng.brq.redhat.com Operations: monitor interval=60s (fence-virt-134-monitor-interval-60s) Resource: fence-virt-135 (class=stonith type=fence_xvm) Attributes: pcmk_host_check=static-list pcmk_host_list=virt-135 pcmk_host_map=virt-135:virt-135.cluster-qe.lab.eng.brq.redhat.com Operations: monitor interval=60s (fence-virt-135-monitor-interval-60s) Resource: fence-virt-136 (class=stonith type=fence_xvm) Attributes: pcmk_host_check=static-list pcmk_host_list=virt-136 pcmk_host_map=virt-136:virt-136.cluster-qe.lab.eng.brq.redhat.com Operations: monitor interval=60s (fence-virt-136-monitor-interval-60s) Fencing Levels: Location Constraints: Resource: cmon Enabled on: virt-136 (score:INFINITY) (id:location-cmon-virt-136-INFINITY) Ordering Constraints: start dlm-clone then start clvmd-clone (kind:Mandatory) Colocation Constraints: clvmd-clone with dlm-clone (score:INFINITY) Ticket Constraints: Alerts: No alerts defined Resources Defaults: No defaults set Operations Defaults: No defaults set Cluster Properties: cluster-infrastructure: corosync cluster-name: STSRHTS2420 dc-version: 1.1.15-11.el7-e174ec8 have-watchdog: false no-quorum-policy: freeze Quorum: Options:
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1862