Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1360234

Summary: ClusterMon will not kill crm_mon process correctly
Product: Red Hat Enterprise Linux 6 Reporter: Josef Zimek <pzimek>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.7CC: abeekhof, agk, cfeist, cluster-maint, fdinitto, jkortus, mnovacek
Target Milestone: rc   
Target Release: 6.9   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pacemaker-1.1.15-3.el6 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1378817 1385753 (view as bug list) Environment:
Last Closed: 2017-03-21 09:52:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1378817, 1385753    

Description Josef Zimek 2016-07-26 10:07:49 UTC
Description of problem:


After putting the cluster in maintanance mode, then reboot the nodes and resetting the maintenance status sometimes the ClusterMon resource keeps running even if "crm_mon" is NOT running. Normally ClusterMon should detect the missing "crm_mon" in reliable way to report the node as "bad"


The mechanism used  by "ClusterMon" script delivered in "/usr/lib/ocf/resource.d/pacemaker/ClusterMon":

The script checks the content of the PID file created by "crm_mon" and tries to send a signal ("kill -s 0 <PID>" ) to this PID.

If any process is running with this PID the resource agent script reports that "crm_mon" is running. But this can be any process running with this PID (e.g. after the reboot of the machine this could be another daemon process)
In this case the reported value of the agent is wrong.

Actual implementation:

ClusterMon_monitor() {
    if [ -f $OCF_RESKEY_pidfile ]; then
        pid=`cat $OCF_RESKEY_pidfile`
        if [ ! -z $pid ]; then
            kill -s 0 $pid >/dev/null 2>&1; rc=$?
            case $rc in
                0) exit $OCF_SUCCESS;;
                1) exit $OCF_NOT_RUNNING;;
                *) exit $OCF_ERR_GENERIC;;
            esac
        fi
    fi
    exit $OCF_NOT_RUNNING
}


So sometimes the "kill -s 0 $pid" command returns "success" even if "crm_mon" is not running



Version-Release number of selected component (if applicable):
pacemaker.x86_64 - 1.1.12-8.el6_7.2


How reproducible:
Problem depends on the actual running processes and their PIDs.

Steps to Reproduce:
1. Set ClusterMon resource 
2. Put node into maintenance mode
3. Reboot node
4. Either find way to force PID, originally assigned to original crm_mon PID, to other process (or try couple times so it happens randomly)
5. Disable maintenance mode 


Actual results:
If PID of crm_mon changes after reboot the ClusterMon_monitor still checks the old PID and can kill other process instead


Expected results:
ClusterMon_monitor checks PID of actual crm_mon (not the old one)



Additional info:

Customer proposed following patch:

ClusterMon_monitor() {
    if [ -f $OCF_RESKEY_pidfile ]; then
        pid=`cat $OCF_RESKEY_pidfile`
        if [ ! -z $pid ]; then
            kill -s 0 $pid >/dev/null 2>&1 \
                  && ps -fp $pid | grep $OCF_RESKEY_pidfile >/dev/null 2>&1
            rc=$?
            case $rc in
                0) exit $OCF_SUCCESS;;
                1) exit $OCF_NOT_RUNNING;;
                *) exit $OCF_ERR_GENERIC;;
            esac
        fi
    fi
    exit $OCF_NOT_RUNNING


This will filter out if the "crm_mon" with the reported PID is running.

(but it still might kill other process)

Comment 4 Oyvind Albrigtsen 2016-09-23 10:11:09 UTC
Tested and working patch: https://github.com/ClusterLabs/pacemaker/pull/1147

Comment 9 michal novacek 2017-01-16 14:22:24 UTC
I have verified that crm mon resource agent is correctly recognized as failed
in pacemaker-1.1.15-4

---

Have running pacemaker cluster configured (1).

before the fix pacemaker-1.1.14-8.el6_8.2.x86_64
================================================
crm_mon is recognized as running even though it is not

[root@virt-157 ~]# pcs resource
 cmon   (ocf::pacemaker:ClusterMon):    Started virt-157

[root@virt-157 ~]# ps axf | grep crm_mon
10381 pts/0    S+     0:00          \_ grep crm_mon
10337 ?        S      0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_cmon.pid -d -i 15 -h /tmp/ClusterMon_cmon.html
[root@virt-157 ~]# cat /tmp/ClusterMon_cmon.pid 
     10337
[root@virt-157 ~]# pcs node maintenance virt-157
[root@virt-157 ~]# kill -9 10337
[root@virt-157 ~]# echo 1 > /tmp/ClusterMon_cmon.pid
[root@virt-157 ~]# pcs node unmaintenance virt-157

[root@virt-157 ~]# pcs resource
 cmon   (ocf::pacemaker:ClusterMon):    Started virt-157

[root@virt-157 ~]# pcs resource debug-monitor cmon
Operation monitor for cmon (ocf:pacemaker:ClusterMon) returned 0

[root@virt-157 ~]# ps axf | grep crm_mon
 3273 pts/0    S+     0:00          \_ grep crm_mon

after the patch pacemaker-1.1.15-4.el6.x86_64
=============================================

[root@virt-157 ~]# pcs resource
 cmon   (ocf::pacemaker:ClusterMon):    Started virt-157
[root@virt-157 ~]# ps axf | grep crm_mon
 4887 pts/1    S+     0:00          \_ grep crm_mon
 4539 ?        S      0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_cmon.pid -d -i 15 -h /tmp/ClusterMon_cmon.html
[root@virt-157 ~]# cat /tmp/ClusterMon_cmon.pid
      4539

[root@virt-157 ~]# pcs node maintenance virt-157
[root@virt-157 ~]# kill -9 4539
[root@virt-157 ~]# echo 1 > /tmp/ClusterMon_cmon.pid 
[root@virt-157 ~]# pcs node unmaintenance virt-157

[root@virt-157 ~]# pcs resource
 cmon   (ocf::pacemaker:ClusterMon):    Started virt-157

[root@virt-157 ~]# pcs resource debug-monitor cmon
Operation monitor for cmon (ocf:pacemaker:ClusterMon) returned 0

[root@virt-157 ~]# ps axf | grep crm_mon
 5327 pts/1    S+     0:00          \_ grep crm_mon
 5102 ?        S      0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_cmon.pid -d -i 15 -h /tmp/ClusterMon_cmon.html

 --

 > (1) pcs config
[root@virt-157 ~]# pcs config
Cluster Name: STSRHTS14109
Corosync Nodes:
 virt-157 virt-159
Pacemaker Nodes:
 virt-157 virt-159

Resources:
 Resource: cmon (class=ocf provider=pacemaker type=ClusterMon)
  Operations: start interval=0s timeout=20 (cmon-start-interval-0s)
              stop interval=0s timeout=20 (cmon-stop-interval-0s)
              monitor interval=10 timeout=20 (cmon-monitor-interval-10)

Stonith Devices:
 Resource: fence-virt-157 (class=stonith type=fence_xvm)
  Attributes: delay=5 pcmk_host_check=static-list pcmk_host_list=virt-157 pcmk_host_map=virt-157:virt-157.cluster-qe.lab.eng.brq.redhat.com
  Operations: monitor interval=60s (fence-virt-157-monitor-interval-60s)
 Resource: fence-virt-159 (class=stonith type=fence_xvm)
  Attributes: pcmk_host_check=static-list pcmk_host_list=virt-159 pcmk_host_map=virt-159:virt-159.cluster-qe.lab.eng.brq.redhat.com
  Operations: monitor interval=60s (fence-virt-159-monitor-interval-60s)
Fencing Levels:

Location Constraints:
  Resource: cmon
    Enabled on: virt-157 (score:INFINITY) (role: Started) (id:cli-prefer-cmon)
Ordering Constraints:
Colocation Constraints:

Resources Defaults:
 No defaults set
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: cman
 dc-version: 1.1.15-4.el6-e174ec8
 have-watchdog: false
 last-lrm-refresh: 1484574749

Comment 11 errata-xmlrpc 2017-03-21 09:52:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2017-0629.html