Bug 1263444

Summary:	Memory leak in pacemaker_remote's proxy dispatch function
Product:	Red Hat Enterprise Linux 7	Reporter:	Ken Gaillot <kgaillot>
Component:	pacemaker	Assignee:	Andrew Beekhof <abeekhof>
Status:	CLOSED CURRENTRELEASE	QA Contact:	cluster-qe <cluster-qe>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	7.1	CC:	cfeist, cluster-maint
Target Milestone:	rc
Target Release:	7.2
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:	1.1.13-8	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-12-03 23:48:23 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ken Gaillot 2015-09-15 19:24:45 UTC

Description of problem: In normal cluster operation, remote nodes running pacemaker_remote will exhibit memory leaks.

Version-Release number of selected component (if applicable): 1.1.12-22.el7_1.4

How reproducible: Run a pacemaker cluster that includes a remote node

Steps to Reproduce:
1. Set up a cluster that includes a remote node.
2. Use valgrind to run pacemaker_remote:
2a. yum install valgrind
2b. Uncomment VALGRIND_OPTS in /etc/sysconfig/pacemaker_remote
2c. mkdir /etc/systemd/system/pacemaker_remote.service.d
2d. cat >/etc/systemd/system/pacemaker_remote.service.d/valgrind.conf <<EOF
[Service]
ExecStart=
ExecStart=/usr/bin/valgrind /usr/sbin/pacemaker_remoted
EOF
2e. Disable the remote node resource if the cluster is running, then "systemctl restart pacemaker_remote", and reenable the remote node resource if needed
3. Start the cluster and perform routine cluster actions. I tried actions like disable and enable a resource running on the remote node, migrating a resource to and from the remote node, setting and unsetting node attributes for the remote node, disabling and enabling the remote node resource itself (from another node in the cluster), and also various CLI commands (crm_attribute, attrd_updater, stonith_admin, crm_mon, etc.) from the remote node itself. You don't need to do all of them, a few is enough.
4. Disable the remote node resource, then "systemctl stop pacemaker_remote"
5. Examine the valgrind output on the remote node (in /var/lib/pacemaker/valgrind-* by default).

Actual results:
Valgrind output will show nonzero "definitely lost" + "indirectly lost" + "possibly lost" byte counts, and the backtraces will contain "crm_ipcs_recv".

Expected results: No memory lost.

Additional info: http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Remote/index.html

Comment 1 Ken Gaillot 2015-09-15 19:27:20 UTC

Fixed upstream as of commit 1019d3e

Comment 2 Ken Gaillot 2015-09-15 19:37:30 UTC

The leak seems serious. It occurs every time the remote node needs to proxy a connection to its hosting cluster node's pacemaker components. The number of bytes lost appears to increase with each occurrence, but I haven't investigated whether that's actually the case or an artifact of how valgrind reports it. If accurate, the loss quickly gets into the 10s of MBs when commands are being continuously run.

Comment 4 Ken Gaillot 2015-09-22 13:53:47 UTC

Fixed upstream as of commit 1019d3e.

Comment 5 Ken Gaillot 2015-12-03 23:48:23 UTC

The fix for this was included in the pacemaker packages released with 7.2.