Bug 1378817

Summary:	ClusterMon will not kill crm_mon process correctly
Product:	Red Hat Enterprise Linux 7	Reporter:	Oyvind Albrigtsen <oalbrigt>
Component:	pacemaker	Assignee:	Ken Gaillot <kgaillot>
Status:	CLOSED ERRATA	QA Contact:	cluster-qe <cluster-qe>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	7.2	CC:	abeekhof, agk, cfeist, cluster-maint, cluster-qe, fdinitto, kgaillot, mnovacek, pzimek
Target Milestone:	rc
Target Release:	7.4
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:	pacemaker-1.1.16-1.el7	Doc Type:	No Doc Update
Doc Text:	This rare and minor issue was not reported by a customer, and does need to be in the 7.4 release notes.	Story Points:	---
Clone Of:	1360234	Environment:
Last Closed:	2017-08-01 17:54:39 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1360234, 1385753
Bug Blocks:

Description Oyvind Albrigtsen 2016-09-23 10:22:29 UTC

+++ This bug was initially created as a clone of Bug #1360234 +++

Description of problem:


After putting the cluster in maintanance mode, then reboot the nodes and resetting the maintenance status sometimes the ClusterMon resource keeps running even if "crm_mon" is NOT running. Normally ClusterMon should detect the missing "crm_mon" in reliable way to report the node as "bad"


The mechanism used  by "ClusterMon" script delivered in "/usr/lib/ocf/resource.d/pacemaker/ClusterMon":

The script checks the content of the PID file created by "crm_mon" and tries to send a signal ("kill -s 0 <PID>" ) to this PID.

If any process is running with this PID the resource agent script reports that "crm_mon" is running. But this can be any process running with this PID (e.g. after the reboot of the machine this could be another daemon process)
In this case the reported value of the agent is wrong.

Actual implementation:

ClusterMon_monitor() {
    if [ -f $OCF_RESKEY_pidfile ]; then
        pid=`cat $OCF_RESKEY_pidfile`
        if [ ! -z $pid ]; then
            kill -s 0 $pid >/dev/null 2>&1; rc=$?
            case $rc in
                0) exit $OCF_SUCCESS;;
                1) exit $OCF_NOT_RUNNING;;
                *) exit $OCF_ERR_GENERIC;;
            esac
        fi
    fi
    exit $OCF_NOT_RUNNING
}


So sometimes the "kill -s 0 $pid" command returns "success" even if "crm_mon" is not running



Version-Release number of selected component (if applicable):
pacemaker-1.1.13-10.el7_2.4.x86_64


How reproducible:
Problem depends on the actual running processes and their PIDs.

Steps to Reproduce:
1. Set ClusterMon resource 
2. Put node into maintenance mode
3. Reboot node
4. Either find way to force PID, originally assigned to original crm_mon PID, to other process (or try couple times so it happens randomly)
5. Disable maintenance mode 


Actual results:
If PID of crm_mon changes after reboot the ClusterMon_monitor still checks the old PID and can kill other process instead


Expected results:
ClusterMon_monitor checks PID of actual crm_mon (not the old one)



Additional info:

Customer proposed following patch:

ClusterMon_monitor() {
    if [ -f $OCF_RESKEY_pidfile ]; then
        pid=`cat $OCF_RESKEY_pidfile`
        if [ ! -z $pid ]; then
            kill -s 0 $pid >/dev/null 2>&1 \
                  && ps -fp $pid | grep $OCF_RESKEY_pidfile >/dev/null 2>&1
            rc=$?
            case $rc in
                0) exit $OCF_SUCCESS;;
                1) exit $OCF_NOT_RUNNING;;
                *) exit $OCF_ERR_GENERIC;;
            esac
        fi
    fi
    exit $OCF_NOT_RUNNING


This will filter out if the "crm_mon" with the reported PID is running.

(but it still might kill other process)

--- Additional comment from Oyvind Albrigtsen on 2016-09-23 12:11:09 CEST ---

Tested and working patch: https://github.com/ClusterLabs/pacemaker/pull/1147

Comment 1 Ken Gaillot 2016-10-17 16:01:17 UTC

*** Bug 1385753 has been marked as a duplicate of this bug. ***

Comment 2 Ken Gaillot 2016-10-17 16:03:04 UTC

Fixed upstream by commit 7b303943

Comment 5 michal novacek 2017-05-24 07:09:29 UTC

I have verified that ClusterMon resource agent is correctly recognized as
failed in paceamker-1.1.16-9

---


Common setup:
    * configure cluster with fencing and ClusterMon resource [1]

before the fix (pacemaker pacemaker-1.1.15-11.el7.x86_64)
=========================================================
[root@virt-136 ~]# pcs resource 
...
 cmon   (ocf::pacemaker:ClusterMon):    Started virt-136

[root@virt-136 ~]# ps axf | grep crm_mon
15327 pts/0    S+     0:00          \_ grep --color=auto crm_mon
15284 ?        S      0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_cmon.pid -d -i 15 -h /tmp/ClusterMon_cmon.html

[root@virt-136 ~]# cat /tmp/ClusterMon_cmon.pid 
     15284
G
[root@virt-136 ~]# pcs node maintenance virt-136

[root@virt-136 ~]# pcs resource 
...
 cmon   (ocf::pacemaker:ClusterMon):    Started virt-136 (unmanaged)


[root@virt-136 ~]# kill -9 15284
[root@virt-136 ~]# echo 1 > /tmp/ClusterMon_cmon.pid 

[root@virt-136 ~]# pcs node unmaintenance virt-136

[root@virt-136 ~]# pcs resource debug-monitor cmon
Operation monitor for cmon (ocf:pacemaker:ClusterMon) returned 0

[root@virt-136 ~]# pcs resource
...
 cmon   (ocf::pacemaker:ClusterMon):    Started virt-136

[root@virt-136 ~]# ps axf | grep crm_mon
15546 pts/0    S+     0:00          \_ grep --color=auto crm_mon


after the fix (pacemaker-1.1.16-9.el7.x86_64)
=============================================

[root@virt-136 ~]# pcs resource
 cmon   (ocf::pacemaker:ClusterMon):    Started virt-136

[root@virt-136 ~]# ps axf | grep crm_mon
10637 pts/0    S+     0:00          \_ grep --color=auto crm_mon
10570 ?        S      0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_cmon.pid -d -i 15 -h /tmp/ClusterMon_cmon.html

[root@virt-136 ~]# cat /tmp/ClusterMon_cmon.pid 
     10570

[root@virt-136 ~]# pcs node maintenance virt-136
[root@virt-136 ~]# kill -9 10570
[root@virt-136 ~]# echo 1 > /tmp/ClusterMon_cmon.pid
[root@virt-136 ~]# pcs node unmaintenance virt-136

[root@virt-136 ~]# pcs resource debug-monitor cmon
Operation monitor for cmon (ocf:pacemaker:ClusterMon) returned 0

[root@virt-136 ~]# pcs resource
 cmon   (ocf::pacemaker:ClusterMon):    Started virt-136

[root@virt-136 ~]# ps axf | grep crm_mon
10783 pts/0    S+     0:00          \_ grep --color=auto crm_mon
10743 ?        S      0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_cmon.pid -d -i 15 -h /tmp/ClusterMon_cmon.html

-----

(1) pcs config

[root@virt-136 ~]# pcs config
Cluster Name: STSRHTS2420
Corosync Nodes:
 virt-134 virt-135 virt-136
Pacemaker Nodes:
 virt-134 virt-135 virt-136

Resources:
 Clone: dlm-clone
  Meta Attrs: interleave=true ordered=true 
  Resource: dlm (class=ocf provider=pacemaker type=controld)
   Operations: monitor interval=30s on-fail=fence (dlm-monitor-interval-30s)
               start interval=0s timeout=90 (dlm-start-interval-0s)
               stop interval=0s timeout=100 (dlm-stop-interval-0s)
 Clone: clvmd-clone
  Meta Attrs: interleave=true ordered=true 
  Resource: clvmd (class=ocf provider=heartbeat type=clvm)
   Attributes: with_cmirrord=1
   Operations: monitor interval=30s on-fail=fence (clvmd-monitor-interval-30s)
               start interval=0s timeout=90 (clvmd-start-interval-0s)
               stop interval=0s timeout=90 (clvmd-stop-interval-0s)
 Resource: cmon (class=ocf provider=pacemaker type=ClusterMon)
  Operations: monitor interval=10 timeout=20 (cmon-monitor-interval-10)
              start interval=0s timeout=20 (cmon-start-interval-0s)
              stop interval=0s timeout=20 (cmon-stop-interval-0s)

Stonith Devices:
 Resource: fence-virt-134 (class=stonith type=fence_xvm)
  Attributes: pcmk_host_check=static-list pcmk_host_list=virt-134 pcmk_host_map=virt-134:virt-134.cluster-qe.lab.eng.brq.redhat.com
  Operations: monitor interval=60s (fence-virt-134-monitor-interval-60s)
 Resource: fence-virt-135 (class=stonith type=fence_xvm)
  Attributes: pcmk_host_check=static-list pcmk_host_list=virt-135 pcmk_host_map=virt-135:virt-135.cluster-qe.lab.eng.brq.redhat.com
  Operations: monitor interval=60s (fence-virt-135-monitor-interval-60s)
 Resource: fence-virt-136 (class=stonith type=fence_xvm)
  Attributes: pcmk_host_check=static-list pcmk_host_list=virt-136 pcmk_host_map=virt-136:virt-136.cluster-qe.lab.eng.brq.redhat.com
  Operations: monitor interval=60s (fence-virt-136-monitor-interval-60s)
Fencing Levels:

Location Constraints:
  Resource: cmon
    Enabled on: virt-136 (score:INFINITY) (id:location-cmon-virt-136-INFINITY)
Ordering Constraints:
  start dlm-clone then start clvmd-clone (kind:Mandatory)
Colocation Constraints:
  clvmd-clone with dlm-clone (score:INFINITY)
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 No defaults set
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: STSRHTS2420
 dc-version: 1.1.15-11.el7-e174ec8
 have-watchdog: false
 no-quorum-policy: freeze

Quorum:
  Options:

Comment 6 errata-xmlrpc 2017-08-01 17:54:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1862