1378817 – ClusterMon will not kill crm_mon process correctly

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1378817 - ClusterMon will not kill crm_mon process correctly

Summary: ClusterMon will not kill crm_mon process correctly

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	7.2
Hardware:	All
OS:	All
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	7.4
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1385753 (view as bug list)
Depends On:	1360234 1385753
Blocks:
TreeView+	depends on / blocked

Reported:	2016-09-23 10:22 UTC by Oyvind Albrigtsen
Modified:	2017-08-01 17:54 UTC (History)
CC List:	9 users (show)
Fixed In Version:	pacemaker-1.1.16-1.el7
Doc Type:	No Doc Update
Doc Text:	This rare and minor issue was not reported by a customer, and does need to be in the 7.4 release notes.
Clone Of:	1360234
Environment:
Last Closed:	2017-08-01 17:54:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2017:1862	0	normal	SHIPPED_LIVE	pacemaker bug fix and enhancement update	2017-08-01 18:04:15 UTC

Description Oyvind Albrigtsen 2016-09-23 10:22:29 UTC

+++ This bug was initially created as a clone of Bug #1360234 +++

Description of problem:


After putting the cluster in maintanance mode, then reboot the nodes and resetting the maintenance status sometimes the ClusterMon resource keeps running even if "crm_mon" is NOT running. Normally ClusterMon should detect the missing "crm_mon" in reliable way to report the node as "bad"


The mechanism used  by "ClusterMon" script delivered in "/usr/lib/ocf/resource.d/pacemaker/ClusterMon":

The script checks the content of the PID file created by "crm_mon" and tries to send a signal ("kill -s 0 <PID>" ) to this PID.

If any process is running with this PID the resource agent script reports that "crm_mon" is running. But this can be any process running with this PID (e.g. after the reboot of the machine this could be another daemon process)
In this case the reported value of the agent is wrong.

Actual implementation:

ClusterMon_monitor() {
    if [ -f $OCF_RESKEY_pidfile ]; then
        pid=`cat $OCF_RESKEY_pidfile`
        if [ ! -z $pid ]; then
            kill -s 0 $pid >/dev/null 2>&1; rc=$?
            case $rc in
                0) exit $OCF_SUCCESS;;
                1) exit $OCF_NOT_RUNNING;;
                *) exit $OCF_ERR_GENERIC;;
            esac
        fi
    fi
    exit $OCF_NOT_RUNNING
}


So sometimes the "kill -s 0 $pid" command returns "success" even if "crm_mon" is not running



Version-Release number of selected component (if applicable):
pacemaker-1.1.13-10.el7_2.4.x86_64


How reproducible:
Problem depends on the actual running processes and their PIDs.

Steps to Reproduce:
1. Set ClusterMon resource 
2. Put node into maintenance mode
3. Reboot node
4. Either find way to force PID, originally assigned to original crm_mon PID, to other process (or try couple times so it happens randomly)
5. Disable maintenance mode 


Actual results:
If PID of crm_mon changes after reboot the ClusterMon_monitor still checks the old PID and can kill other process instead


Expected results:
ClusterMon_monitor checks PID of actual crm_mon (not the old one)



Additional info:

Customer proposed following patch:

ClusterMon_monitor() {
    if [ -f $OCF_RESKEY_pidfile ]; then
        pid=`cat $OCF_RESKEY_pidfile`
        if [ ! -z $pid ]; then
            kill -s 0 $pid >/dev/null 2>&1 \
                  && ps -fp $pid | grep $OCF_RESKEY_pidfile >/dev/null 2>&1
            rc=$?
            case $rc in
                0) exit $OCF_SUCCESS;;
                1) exit $OCF_NOT_RUNNING;;
                *) exit $OCF_ERR_GENERIC;;
            esac
        fi
    fi
    exit $OCF_NOT_RUNNING


This will filter out if the "crm_mon" with the reported PID is running.

(but it still might kill other process)

--- Additional comment from Oyvind Albrigtsen on 2016-09-23 12:11:09 CEST ---

Tested and working patch: https://github.com/ClusterLabs/pacemaker/pull/1147

Comment 1 Ken Gaillot 2016-10-17 16:01:17 UTC

*** Bug 1385753 has been marked as a duplicate of this bug. ***

Comment 2 Ken Gaillot 2016-10-17 16:03:04 UTC

Fixed upstream by commit 7b303943

Comment 5 michal novacek 2017-05-24 07:09:29 UTC

I have verified that ClusterMon resource agent is correctly recognized as
failed in paceamker-1.1.16-9

---


Common setup:
    * configure cluster with fencing and ClusterMon resource [1]

before the fix (pacemaker pacemaker-1.1.15-11.el7.x86_64)
=========================================================
[root@virt-136 ~]# pcs resource 
...
 cmon   (ocf::pacemaker:ClusterMon):    Started virt-136

[root@virt-136 ~]# ps axf | grep crm_mon
15327 pts/0    S+     0:00          \_ grep --color=auto crm_mon
15284 ?        S      0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_cmon.pid -d -i 15 -h /tmp/ClusterMon_cmon.html

[root@virt-136 ~]# cat /tmp/ClusterMon_cmon.pid 
     15284
G
[root@virt-136 ~]# pcs node maintenance virt-136

[root@virt-136 ~]# pcs resource 
...
 cmon   (ocf::pacemaker:ClusterMon):    Started virt-136 (unmanaged)


[root@virt-136 ~]# kill -9 15284
[root@virt-136 ~]# echo 1 > /tmp/ClusterMon_cmon.pid 

[root@virt-136 ~]# pcs node unmaintenance virt-136

[root@virt-136 ~]# pcs resource debug-monitor cmon
Operation monitor for cmon (ocf:pacemaker:ClusterMon) returned 0

[root@virt-136 ~]# pcs resource
...
 cmon   (ocf::pacemaker:ClusterMon):    Started virt-136

[root@virt-136 ~]# ps axf | grep crm_mon
15546 pts/0    S+     0:00          \_ grep --color=auto crm_mon


after the fix (pacemaker-1.1.16-9.el7.x86_64)
=============================================

[root@virt-136 ~]# pcs resource
 cmon   (ocf::pacemaker:ClusterMon):    Started virt-136

[root@virt-136 ~]# ps axf | grep crm_mon
10637 pts/0    S+     0:00          \_ grep --color=auto crm_mon
10570 ?        S      0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_cmon.pid -d -i 15 -h /tmp/ClusterMon_cmon.html

[root@virt-136 ~]# cat /tmp/ClusterMon_cmon.pid 
     10570

[root@virt-136 ~]# pcs node maintenance virt-136
[root@virt-136 ~]# kill -9 10570
[root@virt-136 ~]# echo 1 > /tmp/ClusterMon_cmon.pid
[root@virt-136 ~]# pcs node unmaintenance virt-136

[root@virt-136 ~]# pcs resource debug-monitor cmon
Operation monitor for cmon (ocf:pacemaker:ClusterMon) returned 0

[root@virt-136 ~]# pcs resource
 cmon   (ocf::pacemaker:ClusterMon):    Started virt-136

[root@virt-136 ~]# ps axf | grep crm_mon
10783 pts/0    S+     0:00          \_ grep --color=auto crm_mon
10743 ?        S      0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_cmon.pid -d -i 15 -h /tmp/ClusterMon_cmon.html

-----

(1) pcs config

[root@virt-136 ~]# pcs config
Cluster Name: STSRHTS2420
Corosync Nodes:
 virt-134 virt-135 virt-136
Pacemaker Nodes:
 virt-134 virt-135 virt-136

Resources:
 Clone: dlm-clone
  Meta Attrs: interleave=true ordered=true 
  Resource: dlm (class=ocf provider=pacemaker type=controld)
   Operations: monitor interval=30s on-fail=fence (dlm-monitor-interval-30s)
               start interval=0s timeout=90 (dlm-start-interval-0s)
               stop interval=0s timeout=100 (dlm-stop-interval-0s)
 Clone: clvmd-clone
  Meta Attrs: interleave=true ordered=true 
  Resource: clvmd (class=ocf provider=heartbeat type=clvm)
   Attributes: with_cmirrord=1
   Operations: monitor interval=30s on-fail=fence (clvmd-monitor-interval-30s)
               start interval=0s timeout=90 (clvmd-start-interval-0s)
               stop interval=0s timeout=90 (clvmd-stop-interval-0s)
 Resource: cmon (class=ocf provider=pacemaker type=ClusterMon)
  Operations: monitor interval=10 timeout=20 (cmon-monitor-interval-10)
              start interval=0s timeout=20 (cmon-start-interval-0s)
              stop interval=0s timeout=20 (cmon-stop-interval-0s)

Stonith Devices:
 Resource: fence-virt-134 (class=stonith type=fence_xvm)
  Attributes: pcmk_host_check=static-list pcmk_host_list=virt-134 pcmk_host_map=virt-134:virt-134.cluster-qe.lab.eng.brq.redhat.com
  Operations: monitor interval=60s (fence-virt-134-monitor-interval-60s)
 Resource: fence-virt-135 (class=stonith type=fence_xvm)
  Attributes: pcmk_host_check=static-list pcmk_host_list=virt-135 pcmk_host_map=virt-135:virt-135.cluster-qe.lab.eng.brq.redhat.com
  Operations: monitor interval=60s (fence-virt-135-monitor-interval-60s)
 Resource: fence-virt-136 (class=stonith type=fence_xvm)
  Attributes: pcmk_host_check=static-list pcmk_host_list=virt-136 pcmk_host_map=virt-136:virt-136.cluster-qe.lab.eng.brq.redhat.com
  Operations: monitor interval=60s (fence-virt-136-monitor-interval-60s)
Fencing Levels:

Location Constraints:
  Resource: cmon
    Enabled on: virt-136 (score:INFINITY) (id:location-cmon-virt-136-INFINITY)
Ordering Constraints:
  start dlm-clone then start clvmd-clone (kind:Mandatory)
Colocation Constraints:
  clvmd-clone with dlm-clone (score:INFINITY)
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 No defaults set
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: STSRHTS2420
 dc-version: 1.1.15-11.el7-e174ec8
 have-watchdog: false
 no-quorum-policy: freeze

Quorum:
  Options:

Comment 6 errata-xmlrpc 2017-08-01 17:54:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1862

Note You need to log in before you can comment on or make changes to this bug.