Bug 618321

Summary: modclusterd memory footprint is growing over time
Product: Red Hat Enterprise Linux 5 Reporter: Shane Bradley <sbradley>
Component: clustermonAssignee: Jan Pokorný [poki] <jpokorny>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 5.5CC: bbrock, c.handel, cluster-maint, djansa, edamato, james.brown, jwest, kabbott, rdassen, rmunilla, rsteiger, tao, uwe.knop, vleduc
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: modcluster-0.12.1-7.el5 Doc Type: Bug Fix
Doc Text:
Cause * trigger unknown, presumably uncommon event/attribute of the environment Consequence * outgoing queues in inter-nodes communication are growing over time Fix * better balanced inter-nodes communication + restriction of the queues Result * resources utilization kept at reasonable level * possible queues interventions logged in /var/log/clumond.log
Story Points: ---
Clone Of:
: 742431 (view as bug list) Environment:
Last Closed: 2012-02-21 06:49:32 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 742431, 758797    
Attachments:
Description Flags
sosreport
none
[obsoleted by attachment 541598] minor memory leak fix (RHEL 6 modclusterd only)
none
[PATCH 1/6] fix bz618321: clarify recv/read_restart+send/write_restart
none
[PATCH 2/6] fix bz618321: introduce per-peer outgoing queue pruning
none
[PATCH 3/6] fix bz618321: limit peer's send() to one message only
none
[PATCH 4/6] fix bz618321: read all available with peer's receive()
none
[PATCH 5/6] fix bz618321: split+restructure poll handling in communicator
none
[PATCH 6/6] fix bz618321: turn off Nagle's alg. in peers' communication
none
[PATCH 1/6] fix bz618321: clarify recv/read_restart+send/write_restart none

Description Shane Bradley 2010-07-26 16:42:02 UTC
Description of problem:
On certain machines customer is seeing the process modclusterd consume
and hold on to large portoins of memory. This appears to be a memory
leak of some sort.

Customer has seen this on a couple of machines and uses modclusterd
for snmp monitoring.

Here is example output of what they are seeing:

$ cat uptime
15:25:46 up 69 days,  9:53,  3 users,  load average: 1.05, 0.25, 0.08

$ grep modclusterd ps
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      7422  0.4 41.3 4394244 720792 ?      S<sl May11 487:50 modclusterd

The only way to release this memory is hard kill(-9) modclusterd
process. This should be done before using the init scripts because the
init scripts fail to shutdown the process.

Version-Release number of selected component (if applicable):
modcluster-0.12.1-2.el5-x86_64  

How reproducible:
Not easily. I do believe that snmp might be involved in this.

Steps to Reproduce:
1. Enable modclusterd
2. Monitor memory footprint of modclusterd
  
Actual results:
The amount of memory that modclustered holds grows very large over time.

Expected results:
The amount of memory that modclustered holds should be small and stable.

Additional info:

Reviewed some graphs for this modclusterd process and it does show a
linear consumpution of memory.

Comment 4 Ryan McCabe 2010-10-26 20:19:23 UTC
*** Bug 607602 has been marked as a duplicate of this bug. ***

Comment 9 Jeremy West 2010-11-08 17:58:14 UTC
Created attachment 458835 [details]
sosreport

Comment 31 Karl Abbott 2011-02-23 16:17:28 UTC
Anything of interest from the tests that were left running over the weekend?

Cheers,
Karl Abbott, RHCE
Technical Account Manager

Comment 32 Perry Myers 2011-02-23 16:24:30 UTC
No.  The engineering team was not able to reproduce any memory leakage.  I've asked eng to work with SEG to see if they can reproduce this, but as of now we can't find where the problem might be.

Comment 37 Lon Hohberger 2011-04-11 16:05:55 UTC
One possible problem is a scheduling issue.  The modclusterd program is run in the "Other" (SCHED_OTHER) scheduling task list; if there is a scheduling issue, this all may well improve or go away by changing the default priority into one of the other scheduling queues.

A program I wrote a long time ago can be used to switch the priority queue of modclusterd.  It is located here:

   http://people.redhat.com/lhh/prio.tar.gz

Compile it (cd prio; make), and you can then change modclusterd's scheduling queue to FIFO or RR by running:

   ./prio set `pidof modclusterd` rr 1

or (for SCHED_FIFO):

   ./prio set `pidof modclusterd` fifo 1

To check the priority of modclusterd:

   ./prio `pidof modclusterd`

To set it back to normal, run:

   ./prio set `pidof modclusterd` other 0

Comment 63 Jan Pokorný [poki] 2011-11-18 19:23:23 UTC
Created attachment 534468 [details]
[PATCH 2/6] fix bz618321: introduce per-peer outgoing queue pruning

There is a new "_prune_peer_queues" attribute serving as a flag to mark
per-peer outgoing queues eligible for pruning [*].  It is _set_ in "update"
method called by the Monitor every ca. 5 seconds (its intentional iteration
period).  The flag is _cleared_ after every iteration of appending particular
XML status update ("message") from the global queue to the peer-local queues,
i.e., the first such iteration may lead to pruning the queues if flag
previously set.  This (at least partially) ensures the queues are not
accumulated infinitely under jarring conditions.

As the bool value is switched "atomically" from our point of view and
in addition, we do not require absolute sychronization, accesses to
the flag are not guarded by mutex (pros: no blocking) leading to
"react immediatelly (in the mentioned loops)" behavior.

Additionally, merge "update_peers" and "send" into a single method
as these are always used together (also avoids split of mutex usage).

[*] with pruning, I mean "keep possible half-proceeded XML status update
    in, but drop any subsequent ones"

Comment 64 Jan Pokorný [poki] 2011-11-18 19:25:00 UTC
Created attachment 534469 [details]
[PATCH 3/6] fix bz618321: limit peer's send() to one message only

The commented out assert is meant as note that such condition should always
hold (contrary to previous explicit check, which was a no-op anyway).

Comment 65 Jan Pokorný [poki] 2011-11-18 19:27:03 UTC
Created attachment 534470 [details]
[PATCH 4/6] fix bz618321: read all available with peer's receive()

The preprocessor conditional is kept for easy switch when needed.

Comment 66 Jan Pokorný [poki] 2011-11-18 19:29:55 UTC
Created attachment 534471 [details]
[PATCH 5/6] fix bz618321: split+restructure poll handling in communicator

The old structure:
    1. server socket
        - POLLIN
        - POLLERR | POLLHUP | POLLNVAL
    2. client sockets
        ** POLLIN
        or POLLERR | POLLHUP | POLLNVAL
        or POLLOUT

The new structure:
    1. server socket -> handle_server_socket()
        - POLLIN (accept)
        - POLLERR | POLLHUP | POLLNVAL
    2. client sockets -> handle_client_socket()
        - POLLERR | POLLNVAL
        - POLLIN
        - POLLOUT
        - POLLHUP

Now it is worth to change "poll_data[i].events = POLLOUT"
to "poll_data[i].events |= POLLOUT" as these do not appear
as mutually exclusive now (one go can cover them both).

Also for client sockets -- POLLIN, add an optimization that
_delivery_point.msg_arrived() is called for the last received message
(if any) and not for all of them (see in-code comment).
The preprocessor conditional is kept for easy switch when needed.

Additionally, use "const" for method arguments when desirable and handle
previously suppressed exceptions.

Comment 67 Jan Pokorný [poki] 2011-11-18 19:31:40 UTC
Created attachment 534472 [details]
[PATCH 6/6] fix bz618321: turn off Nagle's alg. in peers' communication

The reason behind that is that we are sending the whole messages
(cluster XML updates) and want to achieve immediate transport
to the other peer that conversely wants to read the whole
message.

Also expose respective Socket's methods as in "nonblocking" case.

Also remove duplicate "nonblocking" setting (will be set in Peer's
constructor anyway).

Comment 68 Jan Pokorný [poki] 2011-11-18 19:52:44 UTC
Created attachment 534482 [details]
[PATCH 1/6] fix bz618321: clarify recv/read_restart+send/write_restart

In fact, {read,write}_restart will never return -EAGAIN/-EWOULDBLOCK
(and never did before).

Comment 74 Jan Pokorný [poki] 2011-11-24 15:59:10 UTC
Patches with cloned bug 742431 (RHEL 6) also apply:

* attachment 535963 [details]:
  bz742431: additional performance improvement patch [1/2]

* attachment 535964 [details]:
  bz742431: additional performance improvement patch [2/2]

Comment 80 Jan Pokorný [poki] 2011-11-28 21:19:49 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause
* trigger unknown, presumably uncommon event/attribute of the environment
Consequence
* outgoing queues in inter-nodes communication are growing over time
Fix
* better balanced inter-nodes communication + restriction of the queues
Result
* resources utilization kept at reasonable level
* possible queues interventions logged in /var/log/clumond.log

Comment 103 errata-xmlrpc 2012-02-21 06:49:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0292.html

Comment 104 Jan Pokorný [poki] 2015-06-22 16:25:54 UTC
*** Bug 1219866 has been marked as a duplicate of this bug. ***