Bug 1263348

Summary: tmpfile leak in mysql resource agent
Product: Red Hat Enterprise Linux 7 Reporter: Frank Enderle <frank.enderle>
Component: resource-agentsAssignee: Oyvind Albrigtsen <oalbrigt>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.1CC: abeekhof, agk, cfeist, cluster-maint, frank.enderle, mnovacek, oalbrigt, rmccabe
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: resource-agents-3.9.5-61.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1292054 (view as bug list) Environment:
Last Closed: 2016-11-03 23:58:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1292054    

Description Frank Enderle 2015-09-15 15:28:54 UTC
Description of problem:
The mysql resource agent leaks the check_slave... tmpfile and thus floods the /var/run/resource-agents directory with check_slave.* files which are not cleaned up, eventually leading to a filled partition or inode shortage.

Version-Release number of selected component (if applicable):
bash-4.2.46-12.el7.x86_64 (CentOS 7.1.1503)
resource-agents-3.9.5-40.el7_1.6.x86_64 (CentOS 7.1.1503)

How reproducible:
always

Steps to Reproduce:
1. Setup mysql Master/Slave Pacemaker cluster 
2. Look at the /var/run/resource-agents directory
3.

Actual results:
Hundreds and thousands of left over check_slave.* files in /var/run/resource-agents

Expected results:
None to a few check_slave.* files in /var/run/resource-agents

Additional info:
I did some debugging and the problem seems to boil down to a strings behaviour in /usr/lib/ocf/resource.d/heartbeat/mysql:

1. The function check_slave() creates the file through the get_slave_info() function in line 378
2. Later on the file should be deleted. Actually the execution goes through line 504 which should delete the file using rm -f $tmpfile
3. Debugging shows that the $tmpfile variable is empty when the rm -f $tmpfile happens. The $tmpfile variable seems to be scoped to the get_slave_info() function, though there is no 'local' declaration in the function.

Comment 2 Fabio Massimo Di Nitto 2015-09-15 17:33:31 UTC
Looking at the code, there are tons of conditions inside the agent that can lead to file temporary file leaks beside the normal runtime monitoring issue.

I have never used mysql is master/slave mode (where this code path is leaking). Can you please share your configuration so i can set it up easily to verify any fix?

Thanks

Comment 3 Frank Enderle 2015-10-02 16:40:35 UTC
I'm currently unable to produce a test case. I will try to find the time in the next week, so please do not autoclose this bug.

Comment 5 Oyvind Albrigtsen 2015-12-16 10:18:21 UTC
OCF_CHECK_LEVEL=10 is required to reproduce the issue:
# pcs resource op add MDB monitor interval=15s OCF_CHECK_LEVEL=10

I tested and verified that the following patch solves the issue:
https://github.com/ClusterLabs/resource-agents/pull/722

Comment 7 Oyvind Albrigtsen 2016-03-01 09:34:55 UTC
Setup:
# pcs resource create MySQL mysql
# pcs resource master MySQL
# pcs resource op add MySQL monitor interval=15s OCF_CHECK_LEVEL=10


Run on the slave node.

Before:
# rpm -q resource-agents
resource-agents-3.9.5-54.el7_2.6.x86_64
# ls /var/run/resource-agents/check_slave.*
/var/run/resource-agents/check_slave.MySQL.uCeh9f

Wait at least 15 seconds.
# ls /var/run/resource-agents/check_slave.*
/var/run/resource-agents/check_slave.MySQL.uCeh9f
/var/run/resource-agents/check_slave.MySQL.uxRvQo


After:
# rpm -q resource-agents
resource-agents-3.9.5-61.el7.x86_64
# ls /var/run/resource-agents/check_slave.*
/var/run/resource-agents/check_slave.MySQL.uCeh9f
/var/run/resource-agents/check_slave.MySQL.uxRvQo

Wait at least 15 seconds.
# ls /var/run/resource-agents/check_slave.*
/var/run/resource-agents/check_slave.MySQL.uCeh9f
/var/run/resource-agents/check_slave.MySQL.uxRvQo

The same number of files (from before the patch still exist, but will be removed on reboot).

Comment 8 michal novacek 2016-09-02 14:15:21 UTC
I have verified that there are no more leftover files on slave after the patch
in resource-agents-3.9.5-81.el7.x86_64

---

common setup
 * setup cluster (1)
 * setup mysql resource as master/slave clone and verify that mysql is 
    running (2)

before the patch (resource-agents-3.9.5-54.el7.x86_64)
======================================================
> on slave virt-022, there are a lot of leftover files

[root@virt-022 ~]# find /var/run/resource-agents
/var/run/resource-agents
/var/run/resource-agents/check_slave.mysql.FtU5zA
/var/run/resource-agents/check_slave.mysql.vIqLEy
/var/run/resource-agents/check_slave.mysql.rGDFFZ
/var/run/resource-agents/check_slave.mysql.8X9GpU
/var/run/resource-agents/check_slave.mysql.Nns5S0
/var/run/resource-agents/check_slave.mysql.jRlGo9
/var/run/resource-agents/check_slave.mysql.pHXcvw
/var/run/resource-agents/check_slave.mysql.YM8ofW
/var/run/resource-agents/check_slave.mysql.ej6his
/var/run/resource-agents/check_slave.mysql.BFzDwJ
/var/run/resource-agents/check_slave.mysql.lvI3t5
/var/run/resource-agents/check_slave.mysql.iQr2Xt
/var/run/resource-agents/check_slave.mysql.4unyu5
/var/run/resource-agents/threads.mysql.XGarNE
/var/run/resource-agents/master_status.mysql
/var/run/resource-agents/cmirrord-clvmd.pid
/var/run/resource-agents/clvmd-clvmd.pid


after the patch (resource-agents-3.9.5-81.el7.x86_64)
=====================================================

> when mysql resource is active on virt-005, virt-022 is slave:

[root@virt-022 ~]# find /var/run/resource-agents
/var/run/resource-agents
/var/run/resource-agents/cmirrord-clvmd.pid
/var/run/resource-agents/clvmd-clvmd.pid


-----

>> (1)
[root@virt-005 ~]# pcs status
Cluster name: STSRHTS2803
Stack: corosync
Current DC: virt-021 (version 1.1.15-10.el7-e174ec8) - partition with quorum
Last updated: Fri Sep  2 15:40:09 2016          Last change: Fri Sep  2 15:34:26 2016 by root via cibadmin on virt-005

4 nodes and 14 resources configured

Online: [ virt-004 virt-005 virt-021 virt-022 ]

Full list of resources:

 fence-virt-004 (stonith:fence_xvm):    Started virt-004
 fence-virt-005 (stonith:fence_xvm):    Started virt-005
 fence-virt-021 (stonith:fence_xvm):    Started virt-021
 fence-virt-022 (stonith:fence_xvm):    Started virt-022
 Clone Set: dlm-clone [dlm]
     Started: [ virt-004 virt-005 virt-021 virt-022 ]
 Clone Set: clvmd-clone [clvmd]
     Started: [ virt-004 virt-005 virt-021 virt-022 ]
 Master/Slave Set: mysql-clone [mysql]
     Masters: [ virt-005 ]
     Slaves: [ virt-022 ]

Failed Actions:

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

>> (2)
[root@virt-005 ~]# pcs resource show --full
 Clone: dlm-clone
  Meta Attrs: interleave=true ordered=true
  Resource: dlm (class=ocf provider=pacemaker type=controld)
   Operations: start interval=0s timeout=90 (dlm-start-interval-0s)
               stop interval=0s timeout=100 (dlm-stop-interval-0s)
               monitor interval=30s on-fail=fence (dlm-monitor-interval-30s)
 Clone: clvmd-clone
  Meta Attrs: interleave=true ordered=true
  Resource: clvmd (class=ocf provider=heartbeat type=clvm)
   Attributes: with_cmirrord=1
   Operations: start interval=0s timeout=90 (clvmd-start-interval-0s)
               stop interval=0s timeout=90 (clvmd-stop-interval-0s)
               monitor interval=30s on-fail=fence (clvmd-monitor-interval-30s)
 Master: mysql-clone
  Meta Attrs: master-node-max=1 clone-max=2 notify=true master-max=1 clone-node-max=1
  Resource: mysql (class=ocf provider=heartbeat type=mysql)
   Operations: start interval=0s timeout=240 (mysql-start-interval-0s)
               stop interval=0s timeout=120 (mysql-stop-interval-0s)
               promote interval=0s timeout=120 (mysql-promote-interval-0s)
               demote interval=0s timeout=120 (mysql-demote-interval-0s)
               monitor interval=30s (mysql-monitor-interval-30s)
               monitor interval=15s OCF_CHECK_LEVEL=10 (mysql-monitor-interval-15s)

Comment 10 errata-xmlrpc 2016-11-03 23:58:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2174.html