| Summary: | Memory leak when objdb reload config | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Milan Broz <mbroz> | ||||||||
| Component: | corosync | Assignee: | Jan Friesse <jfriesse> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||
| Severity: | urgent | Docs Contact: | |||||||||
| Priority: | urgent | ||||||||||
| Version: | 6.1 | CC: | agk, ccaulfie, cluster-maint, djansa, fdinitto, jfriesse, jkortus, lhh, pvrabec, rpeterso, sdake, teigland | ||||||||
| Target Milestone: | rc | Keywords: | ZStream | ||||||||
| Target Release: | --- | ||||||||||
| Hardware: | All | ||||||||||
| OS: | All | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | corosync-1.2.3-28.el6 | Doc Type: | Bug Fix | ||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | |||||||||||
| : | 680155 (view as bug list) | Environment: | |||||||||
| Last Closed: | 2011-05-19 14:24:26 UTC | Type: | --- | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Bug Depends On: | |||||||||||
| Bug Blocks: | 680155, 681258 | ||||||||||
| Attachments: |
|
||||||||||
|
Description
Milan Broz
2011-02-16 11:56:51 UTC
I´ll take this one for now. It might be a cman problem. (and this happens when ricci is not running - cluster miconfiguration) (In reply to comment #3) > (and this happens when ricci is not running - cluster miconfiguration) The problem is slightly different tho. Your cluster had a non consistent cluster.conf around. One node had a version X and the others Y. ricci not running and a series of manual overrides (otherwise it´s simply not possible to get there), lead to that inconsistent setup. the nodes with older config will continue trying to load the new config in a loop in an attempt to recover from that situation. This reload loop seems to be leaking memory, but we will need to investigate exactly if that´s the issue and what component is at fault (there are 3/4 involved in that process). Created attachment 479959 [details]
reproducer
The patch in attachment disables all cluster (cman/cman-preconfig/ccs_xml plugins) reload code, and adds an heavy loop to object_reload_config call.
First startup is still functional, this way we basically exclude any cluster related code from configuration reload path and isolate the issue into corosync.
How to reproduce quickly:
- 2 nodes rhel6
- patch cluster (with attached one), build, install
- edit cluster.conf on node1 to be version="1"
- on node2 version="2"
- start monitoring corosync memory usage
- cman_tool -D join on node1 (config version 1)
- wait a bit
- cman_tool -D join on node2 (config version 2)
corosync on node1 will loop almost immediately on objdb_reload_config and increase memory usage heavily in few seconds (aka be ready for a killall -9 corosync)
Created attachment 480106 [details]
Proposed patch
Main problem seems to be hidden in fact, that old code allocates X items in list, but deletion frees only X-1 items (everything but not tmplist).
Also please note that patch has side effect. Without this patch, trigger totem_objdb_reload_notify was never called. With patch, it's correctly called. I just cross checked this patch and it does indeed fix at least one of the memory leaks in the reload path. When used in conjunction with a good cman, we still experience memory leak, but I am in the process to identify where we leak. On Angus suggestion, running the corosync memory leak test from cts. The code here: corosync from rhel6 + above Honzaf´s patch and no cman or any cluster component loaded at all. Single node test. Note as long as we don´t eliminate this leak, I cannot easily verify possible leaks in cman/cluster on this same code path. [root@rhel6-node2 ~]# sh mem_leak_test.sh -1 1168 [root@rhel6-node2 ~]# sh mem_leak_test.sh -1 144 [root@rhel6-node2 ~]# sh mem_leak_test.sh -1 104 [root@rhel6-node2 ~]# sh mem_leak_test.sh -1 100 [root@rhel6-node2 ~]# sh mem_leak_test.sh -1 100 [root@rhel6-node2 ~]# sh mem_leak_test.sh -1 100 [root@rhel6-node2 ~]# sh mem_leak_test.sh -1 100 [root@rhel6-node2 ~]# sh mem_leak_test.sh -1 104 [root@rhel6-node2 ~]# sh mem_leak_test.sh -1 100 [root@rhel6-node2 ~]# sh mem_leak_test.sh -1 100 [root@rhel6-node2 ~]# sh mem_leak_test.sh -1 100 [root@rhel6-node2 ~]# sh mem_leak_test.sh -1 104 [root@rhel6-node2 ~]# sh mem_leak_test.sh -1 100 Created attachment 480464 [details]
Proposed patch for remove leak of handles
With this second patch I can see that mem_leak_test goes down to 0 after 2 iterations. With cman we are still leaking 12 bytes per reload (down from 16). Note that all the code path that cman is using up to the fail is only related to objdb calls and one to xml. Xml has already been tested separately and shows no leaks. More news: we found the remaining issue in config_xml.lcrso. the corosync bits are all good for what I can say. cloning the bz for cluster/cman. Patches committed in upstream as: 41aeecc4eff296252a1ffc06f8c581ec90b9076d 894ece6a141c2d24a332a7375696615e38ca5375 Verified with corosync-1.2.3-28.el6.x86_64 2 node cluster, start up cluster (no ricci), increase version on one node and restart it, 2nd one begins the loop. Without the patch the memory footprint (RSS) increases by about 1M per minute. With the patch there was no increase for about 20 minutes of looping. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0764.html |