Bug 1360234
| Summary: | ClusterMon will not kill crm_mon process correctly | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Josef Zimek <pzimek> | |
| Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> | |
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | |
| Severity: | unspecified | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 6.7 | CC: | abeekhof, agk, cfeist, cluster-maint, fdinitto, jkortus, mnovacek | |
| Target Milestone: | rc | |||
| Target Release: | 6.9 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | pacemaker-1.1.15-3.el6 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1378817 1385753 (view as bug list) | Environment: | ||
| Last Closed: | 2017-03-21 09:52:10 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1378817, 1385753 | |||
Tested and working patch: https://github.com/ClusterLabs/pacemaker/pull/1147
I have verified that crm mon resource agent is correctly recognized as failed
in pacemaker-1.1.15-4
---
Have running pacemaker cluster configured (1).
before the fix pacemaker-1.1.14-8.el6_8.2.x86_64
================================================
crm_mon is recognized as running even though it is not
[root@virt-157 ~]# pcs resource
cmon (ocf::pacemaker:ClusterMon): Started virt-157
[root@virt-157 ~]# ps axf | grep crm_mon
10381 pts/0 S+ 0:00 \_ grep crm_mon
10337 ? S 0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_cmon.pid -d -i 15 -h /tmp/ClusterMon_cmon.html
[root@virt-157 ~]# cat /tmp/ClusterMon_cmon.pid
10337
[root@virt-157 ~]# pcs node maintenance virt-157
[root@virt-157 ~]# kill -9 10337
[root@virt-157 ~]# echo 1 > /tmp/ClusterMon_cmon.pid
[root@virt-157 ~]# pcs node unmaintenance virt-157
[root@virt-157 ~]# pcs resource
cmon (ocf::pacemaker:ClusterMon): Started virt-157
[root@virt-157 ~]# pcs resource debug-monitor cmon
Operation monitor for cmon (ocf:pacemaker:ClusterMon) returned 0
[root@virt-157 ~]# ps axf | grep crm_mon
3273 pts/0 S+ 0:00 \_ grep crm_mon
after the patch pacemaker-1.1.15-4.el6.x86_64
=============================================
[root@virt-157 ~]# pcs resource
cmon (ocf::pacemaker:ClusterMon): Started virt-157
[root@virt-157 ~]# ps axf | grep crm_mon
4887 pts/1 S+ 0:00 \_ grep crm_mon
4539 ? S 0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_cmon.pid -d -i 15 -h /tmp/ClusterMon_cmon.html
[root@virt-157 ~]# cat /tmp/ClusterMon_cmon.pid
4539
[root@virt-157 ~]# pcs node maintenance virt-157
[root@virt-157 ~]# kill -9 4539
[root@virt-157 ~]# echo 1 > /tmp/ClusterMon_cmon.pid
[root@virt-157 ~]# pcs node unmaintenance virt-157
[root@virt-157 ~]# pcs resource
cmon (ocf::pacemaker:ClusterMon): Started virt-157
[root@virt-157 ~]# pcs resource debug-monitor cmon
Operation monitor for cmon (ocf:pacemaker:ClusterMon) returned 0
[root@virt-157 ~]# ps axf | grep crm_mon
5327 pts/1 S+ 0:00 \_ grep crm_mon
5102 ? S 0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_cmon.pid -d -i 15 -h /tmp/ClusterMon_cmon.html
--
> (1) pcs config
[root@virt-157 ~]# pcs config
Cluster Name: STSRHTS14109
Corosync Nodes:
virt-157 virt-159
Pacemaker Nodes:
virt-157 virt-159
Resources:
Resource: cmon (class=ocf provider=pacemaker type=ClusterMon)
Operations: start interval=0s timeout=20 (cmon-start-interval-0s)
stop interval=0s timeout=20 (cmon-stop-interval-0s)
monitor interval=10 timeout=20 (cmon-monitor-interval-10)
Stonith Devices:
Resource: fence-virt-157 (class=stonith type=fence_xvm)
Attributes: delay=5 pcmk_host_check=static-list pcmk_host_list=virt-157 pcmk_host_map=virt-157:virt-157.cluster-qe.lab.eng.brq.redhat.com
Operations: monitor interval=60s (fence-virt-157-monitor-interval-60s)
Resource: fence-virt-159 (class=stonith type=fence_xvm)
Attributes: pcmk_host_check=static-list pcmk_host_list=virt-159 pcmk_host_map=virt-159:virt-159.cluster-qe.lab.eng.brq.redhat.com
Operations: monitor interval=60s (fence-virt-159-monitor-interval-60s)
Fencing Levels:
Location Constraints:
Resource: cmon
Enabled on: virt-157 (score:INFINITY) (role: Started) (id:cli-prefer-cmon)
Ordering Constraints:
Colocation Constraints:
Resources Defaults:
No defaults set
Operations Defaults:
No defaults set
Cluster Properties:
cluster-infrastructure: cman
dc-version: 1.1.15-4.el6-e174ec8
have-watchdog: false
last-lrm-refresh: 1484574749
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2017-0629.html |
Description of problem: After putting the cluster in maintanance mode, then reboot the nodes and resetting the maintenance status sometimes the ClusterMon resource keeps running even if "crm_mon" is NOT running. Normally ClusterMon should detect the missing "crm_mon" in reliable way to report the node as "bad" The mechanism used by "ClusterMon" script delivered in "/usr/lib/ocf/resource.d/pacemaker/ClusterMon": The script checks the content of the PID file created by "crm_mon" and tries to send a signal ("kill -s 0 <PID>" ) to this PID. If any process is running with this PID the resource agent script reports that "crm_mon" is running. But this can be any process running with this PID (e.g. after the reboot of the machine this could be another daemon process) In this case the reported value of the agent is wrong. Actual implementation: ClusterMon_monitor() { if [ -f $OCF_RESKEY_pidfile ]; then pid=`cat $OCF_RESKEY_pidfile` if [ ! -z $pid ]; then kill -s 0 $pid >/dev/null 2>&1; rc=$? case $rc in 0) exit $OCF_SUCCESS;; 1) exit $OCF_NOT_RUNNING;; *) exit $OCF_ERR_GENERIC;; esac fi fi exit $OCF_NOT_RUNNING } So sometimes the "kill -s 0 $pid" command returns "success" even if "crm_mon" is not running Version-Release number of selected component (if applicable): pacemaker.x86_64 - 1.1.12-8.el6_7.2 How reproducible: Problem depends on the actual running processes and their PIDs. Steps to Reproduce: 1. Set ClusterMon resource 2. Put node into maintenance mode 3. Reboot node 4. Either find way to force PID, originally assigned to original crm_mon PID, to other process (or try couple times so it happens randomly) 5. Disable maintenance mode Actual results: If PID of crm_mon changes after reboot the ClusterMon_monitor still checks the old PID and can kill other process instead Expected results: ClusterMon_monitor checks PID of actual crm_mon (not the old one) Additional info: Customer proposed following patch: ClusterMon_monitor() { if [ -f $OCF_RESKEY_pidfile ]; then pid=`cat $OCF_RESKEY_pidfile` if [ ! -z $pid ]; then kill -s 0 $pid >/dev/null 2>&1 \ && ps -fp $pid | grep $OCF_RESKEY_pidfile >/dev/null 2>&1 rc=$? case $rc in 0) exit $OCF_SUCCESS;; 1) exit $OCF_NOT_RUNNING;; *) exit $OCF_ERR_GENERIC;; esac fi fi exit $OCF_NOT_RUNNING This will filter out if the "crm_mon" with the reported PID is running. (but it still might kill other process)