728586 – Broker memory bloat due to leak of messages.

Bug 728586 - Broker memory bloat due to leak of messages.

Summary: Broker memory bloat due to leak of messages.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	qpid-cpp
Sub Component:
Version:	2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	2.0.10
Target Release:	---
Assignee:	Ken Giusti
QA Contact:	Leonid Zhaldybin
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	742040 744036 745231
TreeView+	depends on / blocked

Reported:	2011-08-05 17:35 UTC by Ken Giusti
Modified:	2014-11-09 22:38 UTC (History)
CC List:	6 users (show)
Fixed In Version:	qpid-cpp-0.10-7.el6_1
Doc Type:	Bug Fix
Doc Text:	Due to a regression, a memory leak was introduced that prevented the broker from correctly releasing messages. Consequently, the broker's memory footprint grew indefinitely. A patch has been provided to address this issue, and the memory leak no longer occurs in the described scenario.
Clone Of:
Environment:
Last Closed:	2011-10-24 11:43:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	698721	0	urgent	CLOSED	durable messages transmission - performance issue (1.3.2 -> 2.0 candidate)	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHBA-2011:1399	0	normal	SHIPPED_LIVE	Red Hat Enterprise MRG Messaging 2.0 bug fix update	2011-10-24 11:43:08 UTC

Internal Links: 698721

Description Ken Giusti 2011-08-05 17:35:32 UTC

Description of problem:


The bugfix for:

https://bugzilla.redhat.com/show_bug.cgi?id=698721

introduced a memory leak that will prevent the broker from correctly releasing messages.  This results in the broker's memory footprint growing indefinitely.

Comment 1 Ken Giusti 2011-08-05 17:38:36 UTC

Fix submitted upstream:

https://issues.apache.org/jira/browse/QPID-3394

http://svn.apache.org/viewvc?view=rev&rev=1153966

Comment 2 Leonid Zhaldybin 2011-10-04 11:42:19 UTC

I've tried to reproduce this memory leak on RHEL5 and RHE6 using a python client which sends and receives durable messages. After running the test for four days, the results are as follows.

RHEL5, i386:
The memory consumption by the qpidd process increased slightly - by 8 KB to be precise (from 61444K to 61452K).  According to the logged pmap output, the memory consumption was NOT changing gradually, it looks like one "step" up by 8 KB. After that change, it doesn't seem to be growing any more.

The list of packages:

python-qpid-0.10-1.el5
python-qpid-qmf-0.10-10.el5
qpid-cpp-client-0.10-8.el5
qpid-cpp-client-devel-0.10-8.el5
qpid-cpp-client-devel-docs-0.10-8.el5
qpid-cpp-client-ssl-0.10-8.el5
qpid-cpp-server-0.10-8.el5
qpid-cpp-server-cluster-0.10-8.el5
qpid-cpp-server-devel-0.10-8.el5
qpid-cpp-server-ssl-0.10-8.el5
qpid-cpp-server-store-0.10-8.el5
qpid-cpp-server-xml-0.10-8.el5
qpid-java-client-0.10-6.el5
qpid-java-common-0.10-6.el5
qpid-java-example-0.10-6.el5
qpid-qmf-0.10-10.el5
qpid-qmf-devel-0.10-10.el5
qpid-tools-0.10-6.el5


RHEL5, x86_64:
Here the memory consumption increased by a significant number, from 223788K up to 289324K (by 64 MB exactly!). Again, from the pmap output it looks like one big "step" which occurred after running the test for 9 and a half hours or so. No more changes observed starting from that moment.


RHEL6, i386:
The memory consumption growth is more than a megabyte - 1304 KB (from 62540K to 63844K). There were two "steps", one by 280 KB, another by 1024 KB.

The list of packages:

python-qpid-0.10-1.el6.noarch
python-qpid-qmf-0.10-10.el6.i686
qpid-cpp-client-0.10-6.el6.i686
qpid-cpp-client-devel-0.10-6.el6.i686
qpid-cpp-client-devel-docs-0.10-6.el6.noarch
qpid-cpp-server-0.10-6.el6.i686
qpid-cpp-server-cluster-0.10-6.el6.i686
qpid-cpp-server-devel-0.10-6.el6.i686
qpid-cpp-server-store-0.10-6.el6.i686
qpid-cpp-server-xml-0.10-6.el6.i686
qpid-java-client-0.10-9.el6.noarch
qpid-java-common-0.10-9.el6.noarch
qpid-java-example-0.10-9.el6.noarch
qpid-qmf-0.10-10.el6.i686
qpid-tools-0.10-5.el6.noarch


RHE6, x86_64:
The memory consumption increased significantly, by 16432 KB (more than 16 MB). It was growing (more or less) gradually during the first ~30 hours, and did not change from the reached maximum of 356492K during the next three days.

Comment 3 Leonid Zhaldybin 2011-10-05 15:34:10 UTC

Tested on RHEL5 with newer packages:

python-qpid-0.10-1.el5
python-qpid-qmf-0.10-10.el5
qpid-cpp-client-0.10-9.el5
qpid-cpp-client-devel-0.10-9.el5
qpid-cpp-client-devel-docs-0.10-9.el5
qpid-cpp-client-ssl-0.10-9.el5
qpid-cpp-server-0.10-9.el5
qpid-cpp-server-cluster-0.10-9.el5
qpid-cpp-server-devel-0.10-9.el5
qpid-cpp-server-ssl-0.10-9.el5
qpid-cpp-server-store-0.10-9.el5
qpid-cpp-server-xml-0.10-9.el5
qpid-java-client-0.10-6.el5
qpid-java-common-0.10-6.el5
qpid-java-example-0.10-6.el5
qpid-qmf-0.10-10.el5
qpid-qmf-devel-0.10-10.el5
qpid-tools-0.10-6.el5


Results for i386 architecture are as follows:
The test was run for 25 hours and during that time the memory consumption increased by 1MB, from 61444KB up to 62148.

Results for x86_64 architecture:
Again, after the test was run for 25 hours, the memory consumption increased by more than 65MB, from 223893KB to 290472KB.

In short, the situation now looks significantly worse than on previous version of packages (see also my comment above):

------------------------------------------------------------
            |       i386               |    x86_64
------------------------------------------------------------
0.10-8      |       8KB (four days)    |    64MB (four days)
------------------------------------------------------------
0.10-9      |       1MB (25 hours)     |    65MB (25 hours)
------------------------------------------------------------

Commands which were used to get info from pmap logs:
# ver8 beginning
[root@lzhaldyb-rhel57i bz728586]# for i in $(ls pmap_ver8/pmap*.log); do tail -1 $i; done | head -15  | tail -10 | awk '{print substr($2,1,length($2)-1)}' | awk 'BEGIN{s=0}{s=s+($1+0)}END{print (s / (0.0 +NR))}'
61444
[root@lzhaldyb-rhel57x bz728586]# for i in $(ls pmap_ver8/pmap*.log); do tail -1 $i; done | head -15  | tail -10 | awk '{print substr($2,1,length($2)-1)}' | awk 'BEGIN{s=0}{s=s+($1+0)}END{print (s / (0.0 +NR))}'
223788
# ver8 end
[root@lzhaldyb-rhel57i bz728586]# for i in $(ls pmap_ver8/pmap*.log); do tail -1 $i; done | head -162  | tail -10 | awk '{print substr($2,1,length($2)-1)}' | awk 'BEGIN{s=0}{s=s+($1+0)}END{print (s / (0.0 +NR))}'
61452
[root@lzhaldyb-rhel57x bz728586]# for i in $(ls pmap_ver8/pmap*.log); do tail -1 $i; done | head -162  | tail -10 | awk '{print substr($2,1,length($2)-1)}' | awk 'BEGIN{s=0}{s=s+($1+0)}END{print (s / (0.0 +NR))}'
289324
# ver9 beginning
[root@lzhaldyb-rhel57i bz728586]# for i in $(ls pmap_ver9/pmap*.log); do tail -1 $i; done | head -15  | tail -10 | awk '{print substr($2,1,length($2)-1)}' | awk 'BEGIN{s=0}{s=s+($1+0)}END{print (s / (0.0 +NR))}'
61124
[root@lzhaldyb-rhel57x bz728586]# for i in $(ls pmap_ver9/pmap*.log); do tail -1 $i; done | head -15  | tail -10 | awk '{print substr($2,1,length($2)-1)}' | awk 'BEGIN{s=0}{s=s+($1+0)}END{print (s / (0.0 +NR))}'
223893
# ver9 end
[root@lzhaldyb-rhel57i bz728586]# for i in $(ls pmap_ver9/pmap*.log); do tail -1 $i; done | head -162  | tail -10 | awk '{print substr($2,1,length($2)-1)}' | awk 'BEGIN{s=0}{s=s+($1+0)}END{print (s / (0.0 +NR))}'
62148
[root@lzhaldyb-rhel57x bz728586]# for i in $(ls pmap_ver9/pmap*.log); do tail -1 $i; done | head -162  | tail -10 | awk '{print substr($2,1,length($2)-1)}' | awk 'BEGIN{s=0}{s=s+($1+0)}END{print (s / (0.0 +NR))}'
290472


-> ASSIGNED

Comment 5 Ken Giusti 2011-10-06 17:39:57 UTC

This bug was raised due to a specific - very serious - error that prevented the broker from freeing up durable messages until the sending session is closed.

The bug presented itself by causing the broker's memory use to grow linearly in relation to the _amount_ of message data that was sent through it.

For example, if you create a sender that passes 4 gigabytes of message data through the broker, and have a receiver that consumes all those messages, the bug made the broker still hold those 4 gigabytes of messages even after they were consumed. The bug kept the memory reserved until the _sender_ closes.

The best way to reproduce this problem would be to have a sender generate very large durable messages continually, and a simple consumer that pulls off the messages. Since the messages are being consumed at roughly the same rate as they are being generated, the amount of memory used to hold these messages should be fairly stable. After generating several gigabytes of message traffic, the sender should stop *but keep the sender's session alive* - do not close it. Verify that the queue is drained by the consumer. Once the queue is drained, the broker's overall memory usage should be reasonable.

There should be a dramatic difference in broker memory usage when running that test pre- and post-fix.

The memory growth described in the previous comments doesn't sound similar to the bug that this BZ addresses. The described behavior sounds much more gradual than this target bug. Though the described behavior should be investigated, I think it is a different issue.

Comment 7 Ken Giusti 2011-10-07 18:22:30 UTC

More specific detail on how-to reproduce this bug:

This bug is a hard one to reproduce :) - but I think I've got a specific example (that at least forces it to occur on my particular laptop)

First, I run "vmstat 5" in a window so I can monitor memory usage in real-time:

[kgiusti@localhost ~]$ vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4  0 1608824 2751516  34672 454152    7   13    41   138  164  140  7  2 89  1  0	
 4  0 1608800 2751996  34680 454080   26    0    31    27 5535 9243 20 17 62  1  0	
 0  1 1608588 2738068  34836 466936  126    0   155   207 5872 9727 26 12 57  4  0	
 2  1 1608456 2744804  34860 457296  107    0   226 18968 5595 9401 18 14 48 21  0	


Then I run a copy of the broker with the following options:

./qpidd --auth no --load-module /home/kgiusti/Desktop/work/store/cpp/lib/.libs/msgstore.so --store-dir ~/tmp/qpid --jfile-size 512 --num-jfiles 32 --log-enable info+

(be sure the --store-dir is empty - I'll do "rm -r ~/tmp/qpid/*" between runs)

Once that broker is running, run the following qpid-perftest against it:


./qpid-perftest -s --iterations 5 --count 500000 --durable yes --npubs 4 --qt 4 --nsubs 4

(qpid-perftest can be found in qpid/cpp/src/tests dir)

Watch the output from qpidd - after a minute or two you'll see the following log mesages:

2011-10-07 14:09:06 info Queue "qpid-perftest1": Flow limit created: flowStopCount=0, flowResumeCount=0, flowStopSize=0, flowResumeSize=0
2011-10-07 14:09:06 notice Journal "qpid-perftest2": Created
2011-10-07 14:09:28 info Queue "qpid-perftest2": Flow limit created: flowStopCount=0, flowResumeCount=0, flowStopSize=0, flowResumeSize=0
2011-10-07 14:09:28 notice Journal "qpid-perftest3": Created
2011-10-07 14:09:50 info Queue "qpid-perftest3": Flow limit created: flowStopCount=0, flowResumeCount=0, flowStopSize=0, flowResumeSize=0

Once that log msg regarding qpid-perftest3 appears, look back a the vmstat window. Within a minute or two, the amount of free memory will drop rapidly, and eventually the system will begin to swap:


procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  1 1510832 2531244  40056 495316    6    0     8 47148 5634 9301 12  5 43 40  0	
 0  1 1510824 2529200  40080 495344   13    0    16 48444 5848 9650 17  5 37 41  0	
 6  1 1510824 2530864  40104 495372    0    0     3 48362 5706 9426 16  5 37 42  0	
 0  2 1510816 2531484  40136 495480    8    0    12 48391 5790 9490 17  5 39 39  0	
 0  1 1510808 2530120  40160 495488   13    0    15 44006 5808 9452 17  6 34 44  0	
 0  1 1510796 2530500  40180 495496   13    0    15 46772 5616 9421  9  5 43 44  0	
 0  1 1510784 2522248  40204 502744   13    0    19 48774 5655 9527 10  6 40 43  0	
 1  0 1510780 2404360  40228 495560    0    0     2 35649 8693 37923 35 14 26 25  0	
 4  1 1510736 2217740  40244 495512   18    0    19 21073 10986 67445 57 20  7 16  0	
 6  0 1510728 2023004  40252 495636   12    0    14 22019 11210 69135 58 22  8 12  0	
 3  0 1510720 1825876  40268 495632    6    0     8 22386 11515 70729 58 22  7 13  0	
 6  0 1510712 1627096  40276 495608   13    0    13 22479 12008 69410 50 21  9 20  0	
 6  0 1510708 1417164  40292 495620    0    0     2 23700 12622 72527 51 22  9 17  0	
 7  0 1510704 1225444  40308 495612    6    0     8 21696 11177 69014 56 22  7 15  0	
 6  2 1510696 1023572  40320 495592   13    0    13 22731 11649 70505 59 23  6 12  0	
 6  0 1510692 829884  40336 495688    6    0     8 21942 11524 70244 57 22  8 14  0	
 3  0 1510688 713184  40352 498532    0    0     2 16703 9936 49884 60 17  9 13  0	


My system is running fedora 15, and has 4 gig of DRAM.  Once that log message appears, all free memory is exhausted within 5 minutes and the system get stuck in a swap storm until I manually kill the broker.

Comment 8 Leonid Zhaldybin 2011-10-09 07:02:59 UTC

The issue has been fixed. Tested on RHEL5.7 / RHEL6.1  i386 / x86_64 on
packages:

RHEL5.7:
python-qpid-0.10-1.el5
qpid-cpp-client-0.10-9.el5
qpid-cpp-client-devel-0.10-9.el5
qpid-cpp-client-devel-docs-0.10-9.el5
qpid-cpp-client-rdma-0.10-9.el5
qpid-cpp-client-ssl-0.10-9.el5
qpid-cpp-mrg-debuginfo-0.10-9.el5
qpid-cpp-server-0.10-9.el5
qpid-cpp-server-cluster-0.10-9.el5
qpid-cpp-server-devel-0.10-9.el5
qpid-cpp-server-rdma-0.10-9.el5
qpid-cpp-server-ssl-0.10-9.el5
qpid-cpp-server-store-0.10-9.el5
qpid-cpp-server-xml-0.10-9.el5
qpid-java-client-0.10-6.el5
qpid-java-common-0.10-6.el5
qpid-java-example-0.10-6.el5
qpid-tools-0.7.946106-12.el5

RHEL6.1:
python-qpid-0.10-1.el6.noarch                                                                                                                                                                        
python-qpid-qmf-0.10-10.el6
qpid-cpp-client-0.10-7.el6_1
qpid-cpp-client-devel-0.10-7.el6_1
qpid-cpp-client-devel-docs-0.10-7.el6_1
qpid-cpp-client-rdma-0.10-7.el6_1
qpid-cpp-client-ssl-0.10-7.el6_1
qpid-cpp-server-0.10-7.el6_1
qpid-cpp-server-cluster-0.10-7.el6_1
qpid-cpp-server-devel-0.10-7.el6_1
qpid-cpp-server-rdma-0.10-7.el6_1
qpid-cpp-server-ssl-0.10-7.el6_1
qpid-cpp-server-store-0.10-7.el6_1
qpid-cpp-server-xml-0.10-7.el6_1
qpid-java-client-0.10-11.el6
qpid-java-common-0.10-11.el6
qpid-java-example-0.10-11.el6
qpid-qmf-0.10-10.el6
qpid-tests-0.10-1.el6
qpid-tools-0.10-5.el6
ruby-qpid-0.7.946106-2.el6
ruby-qpid-qmf-0.10-10.el6

-> VERIFIED

Comment 9 Tomas Capek 2011-10-17 14:04:24 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Due to a regression, a memory leak was introduced that prevented the broker from correctly releasing messages. Consequently, the broker's memory footprint grew indefinitely. A patch has been provided to address this issue, and the memory leak no longer occurs in the described scenario.

Comment 10 errata-xmlrpc 2011-10-24 11:43:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1399.html

Note You need to log in before you can comment on or make changes to this bug.