Bug 742431

Summary: modclusterd memory footprint is growing over time
Product: Red Hat Enterprise Linux 6 Reporter: Fabio Massimo Di Nitto <fdinitto>
Component: clustermonAssignee: Jan Pokorný [poki] <jpokorny>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.1CC: bbrock, c.handel, cluster-maint, edamato, james.brown, jpokorny, jwest, kabbott, rmunilla, rsteiger, sbradley, tao, uwe.knop
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: modcluster-0.16.2-16.el6 Doc Type: Bug Fix
Doc Text:
Cause * trigger unknown, presumably uncommon event/attribute of the environment Consequence * outgoing queues in inter-nodes communication are growing over time Fix * better balanced inter-nodes communication + restriction of the queues Result * resources utilization kept at reasonable level * possible queues interventions logged in /var/log/clumond.log
Story Points: ---
Clone Of: 618321 Environment:
Last Closed: 2012-06-20 11:57:14 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 618321    
Bug Blocks: 756082    
Attachments:
Description Flags
[PATCH 1/6] fix bz742431: clarify recv/read_restart+send/write_restart
none
[PATCH 2/6] fix bz742431: introduce per-peer outgoing queue pruning
none
[PATCH 3/6] fix bz742431: limit peer's send() to one message only
none
[PATCH 4/6] fix bz742431: read all available with peer's receive()
none
[PATCH 5/6] fix bz742431: split+restructure poll handling in communicator
none
[PATCH 6/6] fix bz742431: turn off Nagle's alg. in peers' communication
none
bz742431: additional performance improvement patch [1/2]
none
bz742431: additional performance improvement patch [2/2]
none
bz742431: additional fix for a minor memory leak none

Comment 1 Fabio Massimo Di Nitto 2011-09-30 06:00:25 UTC
According to:

https://www.redhat.com/archives/linux-cluster/2011-September/msg00067.html

this issue exists in RHEL6 too.

Comment 5 Jan Pokorný [poki] 2011-11-24 15:37:19 UTC
Created attachment 535954 [details]
[PATCH 1/6] fix bz742431: clarify recv/read_restart+send/write_restart

Comment 6 Jan Pokorný [poki] 2011-11-24 15:39:45 UTC
Created attachment 535955 [details]
[PATCH 2/6] fix bz742431: introduce per-peer outgoing queue pruning

Comment 7 Jan Pokorný [poki] 2011-11-24 15:40:45 UTC
Created attachment 535956 [details]
[PATCH 3/6] fix bz742431: limit peer's send() to one message only

Comment 8 Jan Pokorný [poki] 2011-11-24 15:42:05 UTC
Created attachment 535957 [details]
[PATCH 4/6] fix bz742431: read all available with peer's receive()

Comment 9 Jan Pokorný [poki] 2011-11-24 15:43:10 UTC
Created attachment 535959 [details]
[PATCH 5/6] fix bz742431: split+restructure poll handling in communicator

Comment 10 Jan Pokorný [poki] 2011-11-24 15:44:15 UTC
Created attachment 535961 [details]
[PATCH 6/6] fix bz742431: turn off Nagle's alg. in peers' communication

Comment 11 Jan Pokorný [poki] 2011-11-24 15:46:20 UTC
Created attachment 535963 [details]
bz742431: additional performance improvement patch [1/2]

Comment 12 Jan Pokorný [poki] 2011-11-24 15:47:10 UTC
Created attachment 535964 [details]
bz742431: additional performance improvement patch [2/2]

Comment 14 Radek Steiger 2011-11-28 18:17:07 UTC
As per Comment https://bugzilla.redhat.com/show_bug.cgi?id=618321#c75 acking this for QA using an artificial test as described.

Comment 15 Jan Pokorný [poki] 2011-12-06 21:06:05 UTC
Created attachment 541598 [details]
bz742431: additional fix for a minor memory leak

Original patch attachment 529083 [details] (accidentally posted
by bug 618321 whereas it should have been here) revisited.

Recap: the leaking triggered with connections to /var/run/clumond.sock
       (2 B per connection IIRC, incomparable with that big memory issue)

Comment 18 Jan Pokorný [poki] 2012-04-27 13:58:05 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause
* trigger unknown, presumably uncommon event/attribute of the environment
Consequence
* outgoing queues in inter-nodes communication are growing over time
Fix
* better balanced inter-nodes communication + restriction of the queues
Result
* resources utilization kept at reasonable level
* possible queues interventions logged in /var/log/clumond.log

Comment 25 errata-xmlrpc 2012-06-20 11:57:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0750.html