RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1305119 - corosync memory footprint increases on every node rejoin
Summary: corosync memory footprint increases on every node rejoin
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: corosync
Version: 6.8
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: rc
: ---
Assignee: Jan Friesse
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
: 1309809 (view as bug list)
Depends On:
Blocks: 1306349
TreeView+ depends on / blocked
 
Reported: 2016-02-05 17:46 UTC by Jaroslav Kortus
Modified: 2019-10-10 11:07 UTC (History)
4 users (show)

Fixed In Version: corosync-1.4.7-5.el6
Doc Type: Bug Fix
Doc Text:
Cause: User rejoins node. Consequence: Some buffers in corosync are not freed so memory consumption grows. Fix: Make sure all buffers are fixed. Result: No memory is leaked.
Clone Of:
: 1306349 (view as bug list)
Environment:
Last Closed: 2016-05-10 19:43:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
cluster logs (crm_report) (100.82 KB, application/x-bzip)
2016-02-08 13:31 UTC, Jaroslav Kortus
no flags Details
Proposed patch (3.55 KB, patch)
2016-02-10 15:01 UTC, Jan Friesse
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 678653 0 None None None 2016-02-19 14:34:57 UTC
Red Hat Product Errata RHBA-2016:0753 0 normal SHIPPED_LIVE corosync bug fix update 2016-05-10 22:32:07 UTC

Description Jaroslav Kortus 2016-02-05 17:46:12 UTC
Description of problem:
2-node cluster, run service cman restart on one node, watch corosync mem consumption (RSS) on the other.

Eventually, OOM killer is invoked. It happened quite early on our virts, as they do not have that much of RAM.

Version-Release number of selected component (if applicable):
corosync-1.4.7-4.el6.x86_64

How reproducible:
always

Steps to Reproduce:
1. start 2-node cluster
2. run service cman restart in a loop on one node
3. watch RSS size of corosync on the other node (cat /proc/`pgrep corosync`/status | grep RSS)
4. in my condition the number jumps up by about 2M on every cycle

Actual results:
memory footprint increasing, OOM invoked eventually
Feb  5 17:04:25 virt-002 kernel: Out of memory: Kill process 5699 (corosync) score 412 or sacrifice child
Feb  5 17:04:25 virt-002 kernel: Killed process 5699, UID 0, (corosync) total-vm:1088500kB, anon-rss:776288kB, file-rss:44144kB


Expected results:
the memory footprint should be almost the same throughout many cycles of rejoining.

Additional info:

Comment 2 Jan Friesse 2016-02-08 07:00:20 UTC
Jaroslav,
can you please attach config file and corosync.log from both nodes? Is this behavior new in 1.4.7-4 or it was also in 1.4.7-2? Can you try corosync without cman so we can reduce scope to corosync only (and not cman)?

Comment 3 Jaroslav Kortus 2016-02-08 13:31:34 UTC
Created attachment 1122175 [details]
cluster logs (crm_report)

Comment 4 Jaroslav Kortus 2016-02-08 13:34:10 UTC
I've attached crm_report, which I hope has all useful info in one package.
The same behaviour can be observed using corosync-1.4.7-2.el6.x86_64 (RHEL 6.7).

Diff after 10 iterations (1.4.7-2):
VmRSS:     59844 kB
VmRSS:     80404 kB

Comment 5 Jan Friesse 2016-02-08 15:37:37 UTC
Ok, so it looks like pretty minimal two node cluster.

Can you please try corosync without cman so we can reduce scope to corosync only (and not cman)?

Comment 6 Jan Friesse 2016-02-09 14:10:52 UTC
Also pcsd was hitting following glibc bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1102739

So maybe it's same problem.

Comment 7 Jan Friesse 2016-02-10 15:01:40 UTC
Created attachment 1122820 [details]
Proposed patch

totempg: Fix memory leak

Previously there were two free lists. One for operational and one for
transitional state. Because every node starts in transitional state and
always ends in the operational state, assembly was always put to normal
state free list and never in transitional free list, so new assembly
structure was always allocated after new node connected.

Solution is to have only one free list.

Comment 10 Jan Friesse 2016-02-19 08:01:19 UTC
*** Bug 1309809 has been marked as a duplicate of this bug. ***

Comment 13 errata-xmlrpc 2016-05-10 19:43:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0753.html


Note You need to log in before you can comment on or make changes to this bug.