Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 677975

Summary:

Memory leak when objdb reload config

Product:

Red Hat Enterprise Linux 6

Reporter:

Milan Broz <mbroz>

Component:

corosync

Assignee:

Jan Friesse <jfriesse>

Status:

CLOSED ERRATA

QA Contact:

Cluster QE <mspqa-list>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

6.1

CC:

agk, ccaulfie, cluster-maint, djansa, fdinitto, jfriesse, jkortus, lhh, pvrabec, rpeterso, sdake, teigland

Target Milestone:

Keywords:

ZStream

Target Release:

---

Hardware:

All

OS:

All

Whiteboard:

Fixed In Version:

corosync-1.2.3-28.el6

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

680155 (view as bug list)

Environment:

Last Closed:

2011-05-19 14:24:26 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

680155, 681258

Attachments:

Description	Flags
reproducer	none
Proposed patch	none
Proposed patch for remove leak of handles	none

Description Milan Broz 2011-02-16 11:56:51 UTC

Description of problem:

When accidentally one node had new cluster.conf version and other nodes are suspended, cluster repeats these steps

Feb 16 12:53:10 bar-04 corosync[1376]:   [CMAN  ] Activity suspended on this node
Feb 16 12:53:10 bar-04 corosync[1376]:   [CMAN  ] Error reloading the configuration, will retry every second
Feb 16 12:53:11 bar-04 corosync[1376]:   [CMAN  ] Unable to load new config in corosync: New configuration version has to be newer than current running configuration
Feb 16 12:53:11 bar-04 corosync[1376]:   [CMAN  ] Can't get updated config version 18: New configuration version has to be newer than current running configuration#012.

But it seems corosync is leaking some memory here, and after some time it ends with OOM.

watch "ps axu |grep corosync" shows slowly increasing memory use.

Version-Release number of selected component (if applicable):

cman-3.0.12-23.el6_0.4.x86_64
corosync-1.2.3-21.el6.x86_64

Comment 1 Fabio Massimo Di Nitto 2011-02-16 11:59:03 UTC

I´ll take this one for now. It might be a cman problem.

Comment 3 Milan Broz 2011-02-16 12:13:31 UTC

(and this happens when ricci is not running - cluster miconfiguration)

Comment 4 Fabio Massimo Di Nitto 2011-02-16 12:40:20 UTC

(In reply to comment #3)
> (and this happens when ricci is not running - cluster miconfiguration)

The problem is slightly different tho.

Your cluster had a non consistent cluster.conf around. One node had a version X
and the others Y.

ricci not running and a series of manual overrides (otherwise it´s simply not
possible to get there), lead to that inconsistent setup.

the nodes with older config will continue trying to load the new config in a
loop in an attempt to recover from that situation.

This reload loop seems to be leaking memory, but we will need to investigate
exactly if that´s the issue and what component is at fault (there are 3/4
involved in that process).

Comment 5 Fabio Massimo Di Nitto 2011-02-21 17:06:58 UTC

Created attachment 479959 [details]
reproducer

The patch in attachment disables all cluster (cman/cman-preconfig/ccs_xml plugins) reload code, and adds an heavy loop to object_reload_config call.
First startup is still functional, this way we basically exclude any cluster related code from configuration reload path and isolate the issue into corosync.

How to reproduce quickly:

- 2 nodes rhel6
- patch cluster (with attached one), build, install
- edit cluster.conf on node1 to be version="1"
- on node2 version="2"
- start monitoring corosync memory usage
- cman_tool -D join on node1 (config version 1)
- wait a bit
- cman_tool -D join on node2 (config version 2)

corosync on node1 will loop almost immediately on objdb_reload_config and increase memory usage heavily in few seconds (aka be ready for a killall -9 corosync)

Comment 6 Jan Friesse 2011-02-22 11:24:36 UTC

Created attachment 480106 [details]
Proposed patch

Main problem seems to be hidden in fact, that old code allocates X items in list, but deletion frees only X-1 items (everything but not tmplist).

Comment 7 Jan Friesse 2011-02-22 15:11:06 UTC

Also please note that patch has side effect. Without this patch, trigger totem_objdb_reload_notify was never called. With patch, it's correctly called.

Comment 8 Fabio Massimo Di Nitto 2011-02-23 09:07:35 UTC

I just cross checked this patch and it does indeed fix at least one of the memory leaks in the reload path.

When used in conjunction with a good cman, we still experience memory leak, but I am in the process to identify where we leak.

Comment 9 Fabio Massimo Di Nitto 2011-02-23 10:15:44 UTC

On Angus suggestion, running the corosync memory leak test from cts.

The code here: corosync from rhel6 + above Honzaf´s patch and no cman or any cluster component loaded at all. Single node test.

Note as long as we don´t eliminate this leak, I cannot easily verify possible leaks in cman/cluster on this same code path.

[root@rhel6-node2 ~]# sh mem_leak_test.sh -1
1168
[root@rhel6-node2 ~]# sh mem_leak_test.sh -1
144
[root@rhel6-node2 ~]# sh mem_leak_test.sh -1
104
[root@rhel6-node2 ~]# sh mem_leak_test.sh -1
100
[root@rhel6-node2 ~]# sh mem_leak_test.sh -1
100
[root@rhel6-node2 ~]# sh mem_leak_test.sh -1
100
[root@rhel6-node2 ~]# sh mem_leak_test.sh -1
100
[root@rhel6-node2 ~]# sh mem_leak_test.sh -1
104
[root@rhel6-node2 ~]# sh mem_leak_test.sh -1
100
[root@rhel6-node2 ~]# sh mem_leak_test.sh -1
100
[root@rhel6-node2 ~]# sh mem_leak_test.sh -1
100
[root@rhel6-node2 ~]# sh mem_leak_test.sh -1
104
[root@rhel6-node2 ~]# sh mem_leak_test.sh -1
100

Comment 10 Jan Friesse 2011-02-23 14:20:11 UTC

Created attachment 480464 [details]
Proposed patch for remove leak of handles

Comment 11 Fabio Massimo Di Nitto 2011-02-23 16:21:26 UTC

With this second patch I can see that mem_leak_test goes down to 0 after 2 iterations.

With cman we are still leaking 12 bytes per reload (down from 16).

Note that all the code path that cman is using up to the fail is only related to objdb calls and one to xml. Xml has already been tested separately and shows no leaks.

Comment 12 Fabio Massimo Di Nitto 2011-02-24 14:01:26 UTC

More news:

we found the remaining issue in config_xml.lcrso.

the corosync bits are all good for what I can say.

cloning the bz for cluster/cman.

Comment 13 Jan Friesse 2011-02-24 14:55:30 UTC

Patches committed in upstream as:
41aeecc4eff296252a1ffc06f8c581ec90b9076d
894ece6a141c2d24a332a7375696615e38ca5375

Comment 22 Jaroslav Kortus 2011-03-03 16:05:04 UTC

Verified with corosync-1.2.3-28.el6.x86_64

2 node cluster, start up cluster (no ricci), increase version on one node and restart it, 2nd one begins the loop. Without the patch the memory footprint (RSS) increases by about 1M per minute. With the patch there was no increase for about 20 minutes of looping.

Comment 23 errata-xmlrpc 2011-05-19 14:24:26 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0764.html