Bug 728586

Summary: Broker memory bloat due to leak of messages.
Product: Red Hat Enterprise MRG Reporter: Ken Giusti <kgiusti>
Component: qpid-cppAssignee: Ken Giusti <kgiusti>
Status: CLOSED ERRATA QA Contact: Leonid Zhaldybin <lzhaldyb>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 2.0CC: esammons, freznice, iboverma, jross, lzhaldyb, tross
Target Milestone: 2.0.10   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: qpid-cpp-0.10-7.el6_1 Doc Type: Bug Fix
Doc Text:
Due to a regression, a memory leak was introduced that prevented the broker from correctly releasing messages. Consequently, the broker's memory footprint grew indefinitely. A patch has been provided to address this issue, and the memory leak no longer occurs in the described scenario.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-10-24 11:43:16 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On:    
Bug Blocks: 742040, 744036, 745231    

Description Ken Giusti 2011-08-05 17:35:32 UTC
Description of problem:


The bugfix for:

https://bugzilla.redhat.com/show_bug.cgi?id=698721

introduced a memory leak that will prevent the broker from correctly releasing messages.  This results in the broker's memory footprint growing indefinitely.

Comment 2 Leonid Zhaldybin 2011-10-04 11:42:19 UTC
I've tried to reproduce this memory leak on RHEL5 and RHE6 using a python client which sends and receives durable messages. After running the test for four days, the results are as follows.

RHEL5, i386:
The memory consumption by the qpidd process increased slightly - by 8 KB to be precise (from 61444K to 61452K).  According to the logged pmap output, the memory consumption was NOT changing gradually, it looks like one "step" up by 8 KB. After that change, it doesn't seem to be growing any more.

The list of packages:

python-qpid-0.10-1.el5
python-qpid-qmf-0.10-10.el5
qpid-cpp-client-0.10-8.el5
qpid-cpp-client-devel-0.10-8.el5
qpid-cpp-client-devel-docs-0.10-8.el5
qpid-cpp-client-ssl-0.10-8.el5
qpid-cpp-server-0.10-8.el5
qpid-cpp-server-cluster-0.10-8.el5
qpid-cpp-server-devel-0.10-8.el5
qpid-cpp-server-ssl-0.10-8.el5
qpid-cpp-server-store-0.10-8.el5
qpid-cpp-server-xml-0.10-8.el5
qpid-java-client-0.10-6.el5
qpid-java-common-0.10-6.el5
qpid-java-example-0.10-6.el5
qpid-qmf-0.10-10.el5
qpid-qmf-devel-0.10-10.el5
qpid-tools-0.10-6.el5


RHEL5, x86_64:
Here the memory consumption increased by a significant number, from 223788K up to 289324K (by 64 MB exactly!). Again, from the pmap output it looks like one big "step" which occurred after running the test for 9 and a half hours or so. No more changes observed starting from that moment.


RHEL6, i386:
The memory consumption growth is more than a megabyte - 1304 KB (from 62540K to 63844K). There were two "steps", one by 280 KB, another by 1024 KB.

The list of packages:

python-qpid-0.10-1.el6.noarch
python-qpid-qmf-0.10-10.el6.i686
qpid-cpp-client-0.10-6.el6.i686
qpid-cpp-client-devel-0.10-6.el6.i686
qpid-cpp-client-devel-docs-0.10-6.el6.noarch
qpid-cpp-server-0.10-6.el6.i686
qpid-cpp-server-cluster-0.10-6.el6.i686
qpid-cpp-server-devel-0.10-6.el6.i686
qpid-cpp-server-store-0.10-6.el6.i686
qpid-cpp-server-xml-0.10-6.el6.i686
qpid-java-client-0.10-9.el6.noarch
qpid-java-common-0.10-9.el6.noarch
qpid-java-example-0.10-9.el6.noarch
qpid-qmf-0.10-10.el6.i686
qpid-tools-0.10-5.el6.noarch


RHE6, x86_64:
The memory consumption increased significantly, by 16432 KB (more than 16 MB). It was growing (more or less) gradually during the first ~30 hours, and did not change from the reached maximum of 356492K during the next three days.

Comment 3 Leonid Zhaldybin 2011-10-05 15:34:10 UTC
Tested on RHEL5 with newer packages:

python-qpid-0.10-1.el5
python-qpid-qmf-0.10-10.el5
qpid-cpp-client-0.10-9.el5
qpid-cpp-client-devel-0.10-9.el5
qpid-cpp-client-devel-docs-0.10-9.el5
qpid-cpp-client-ssl-0.10-9.el5
qpid-cpp-server-0.10-9.el5
qpid-cpp-server-cluster-0.10-9.el5
qpid-cpp-server-devel-0.10-9.el5
qpid-cpp-server-ssl-0.10-9.el5
qpid-cpp-server-store-0.10-9.el5
qpid-cpp-server-xml-0.10-9.el5
qpid-java-client-0.10-6.el5
qpid-java-common-0.10-6.el5
qpid-java-example-0.10-6.el5
qpid-qmf-0.10-10.el5
qpid-qmf-devel-0.10-10.el5
qpid-tools-0.10-6.el5


Results for i386 architecture are as follows:
The test was run for 25 hours and during that time the memory consumption increased by 1MB, from 61444KB up to 62148.

Results for x86_64 architecture:
Again, after the test was run for 25 hours, the memory consumption increased by more than 65MB, from 223893KB to 290472KB.

In short, the situation now looks significantly worse than on previous version of packages (see also my comment above):

------------------------------------------------------------
            |       i386               |    x86_64
------------------------------------------------------------
0.10-8      |       8KB (four days)    |    64MB (four days)
------------------------------------------------------------
0.10-9      |       1MB (25 hours)     |    65MB (25 hours)
------------------------------------------------------------

Commands which were used to get info from pmap logs:
# ver8 beginning
[root@lzhaldyb-rhel57i bz728586]# for i in $(ls pmap_ver8/pmap*.log); do tail -1 $i; done | head -15  | tail -10 | awk '{print substr($2,1,length($2)-1)}' | awk 'BEGIN{s=0}{s=s+($1+0)}END{print (s / (0.0 +NR))}'
61444
[root@lzhaldyb-rhel57x bz728586]# for i in $(ls pmap_ver8/pmap*.log); do tail -1 $i; done | head -15  | tail -10 | awk '{print substr($2,1,length($2)-1)}' | awk 'BEGIN{s=0}{s=s+($1+0)}END{print (s / (0.0 +NR))}'
223788
# ver8 end
[root@lzhaldyb-rhel57i bz728586]# for i in $(ls pmap_ver8/pmap*.log); do tail -1 $i; done | head -162  | tail -10 | awk '{print substr($2,1,length($2)-1)}' | awk 'BEGIN{s=0}{s=s+($1+0)}END{print (s / (0.0 +NR))}'
61452
[root@lzhaldyb-rhel57x bz728586]# for i in $(ls pmap_ver8/pmap*.log); do tail -1 $i; done | head -162  | tail -10 | awk '{print substr($2,1,length($2)-1)}' | awk 'BEGIN{s=0}{s=s+($1+0)}END{print (s / (0.0 +NR))}'
289324
# ver9 beginning
[root@lzhaldyb-rhel57i bz728586]# for i in $(ls pmap_ver9/pmap*.log); do tail -1 $i; done | head -15  | tail -10 | awk '{print substr($2,1,length($2)-1)}' | awk 'BEGIN{s=0}{s=s+($1+0)}END{print (s / (0.0 +NR))}'
61124
[root@lzhaldyb-rhel57x bz728586]# for i in $(ls pmap_ver9/pmap*.log); do tail -1 $i; done | head -15  | tail -10 | awk '{print substr($2,1,length($2)-1)}' | awk 'BEGIN{s=0}{s=s+($1+0)}END{print (s / (0.0 +NR))}'
223893
# ver9 end
[root@lzhaldyb-rhel57i bz728586]# for i in $(ls pmap_ver9/pmap*.log); do tail -1 $i; done | head -162  | tail -10 | awk '{print substr($2,1,length($2)-1)}' | awk 'BEGIN{s=0}{s=s+($1+0)}END{print (s / (0.0 +NR))}'
62148
[root@lzhaldyb-rhel57x bz728586]# for i in $(ls pmap_ver9/pmap*.log); do tail -1 $i; done | head -162  | tail -10 | awk '{print substr($2,1,length($2)-1)}' | awk 'BEGIN{s=0}{s=s+($1+0)}END{print (s / (0.0 +NR))}'
290472


-> ASSIGNED

Comment 5 Ken Giusti 2011-10-06 17:39:57 UTC
This bug was raised due to a specific - very serious - error that prevented the broker from freeing up durable messages until the sending session is closed.

The bug presented itself by causing the broker's memory use to grow linearly in relation to the _amount_ of message data that was sent through it.

For example, if you create a sender that passes 4 gigabytes of message data through the broker, and have a receiver that consumes all those messages, the bug made the broker still hold those 4 gigabytes of messages even after they were consumed.  The bug kept the memory reserved until the _sender_ closes.

The best way to reproduce this problem would be to have a sender generate very large durable messages continually, and a simple consumer that pulls off the messages.  Since the messages are being consumed at roughly the same rate as they are being generated, the amount of memory used to hold these messages should be fairly stable.  After generating several gigabytes of message traffic, the sender should stop *but keep the sender's session alive* - do not close it.  Verify that the queue is drained by the consumer.  Once the queue is drained, the broker's overall memory usage should be reasonable.

There should be a dramatic difference in broker memory usage when running that test pre- and post-fix.

The memory growth described in the previous comments doesn't sound similar to the bug that this BZ addresses.  The described behavior sounds much more gradual than this target bug.  Though the described behavior should be investigated, I think it is a different issue.

Comment 7 Ken Giusti 2011-10-07 18:22:30 UTC
More specific detail on how-to reproduce this bug:

This bug is a hard one to reproduce :) - but I think I've got a specific example (that at least forces it to occur on my particular laptop)

First, I run "vmstat 5" in a window so I can monitor memory usage in real-time:

[kgiusti@localhost ~]$ vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4  0 1608824 2751516  34672 454152    7   13    41   138  164  140  7  2 89  1  0	
 4  0 1608800 2751996  34680 454080   26    0    31    27 5535 9243 20 17 62  1  0	
 0  1 1608588 2738068  34836 466936  126    0   155   207 5872 9727 26 12 57  4  0	
 2  1 1608456 2744804  34860 457296  107    0   226 18968 5595 9401 18 14 48 21  0	


Then I run a copy of the broker with the following options:

./qpidd --auth no --load-module /home/kgiusti/Desktop/work/store/cpp/lib/.libs/msgstore.so --store-dir ~/tmp/qpid --jfile-size 512 --num-jfiles 32 --log-enable info+

(be sure the --store-dir is empty - I'll do "rm -r ~/tmp/qpid/*" between runs)

Once that broker is running, run the following qpid-perftest against it:


./qpid-perftest -s --iterations 5 --count 500000 --durable yes --npubs 4 --qt 4 --nsubs 4

(qpid-perftest can be found in qpid/cpp/src/tests dir)

Watch the output from qpidd - after a minute or two you'll see the following log mesages:

2011-10-07 14:09:06 info Queue "qpid-perftest1": Flow limit created: flowStopCount=0, flowResumeCount=0, flowStopSize=0, flowResumeSize=0
2011-10-07 14:09:06 notice Journal "qpid-perftest2": Created
2011-10-07 14:09:28 info Queue "qpid-perftest2": Flow limit created: flowStopCount=0, flowResumeCount=0, flowStopSize=0, flowResumeSize=0
2011-10-07 14:09:28 notice Journal "qpid-perftest3": Created
2011-10-07 14:09:50 info Queue "qpid-perftest3": Flow limit created: flowStopCount=0, flowResumeCount=0, flowStopSize=0, flowResumeSize=0

Once that log msg regarding qpid-perftest3 appears, look back a the vmstat window. Within a minute or two, the amount of free memory will drop rapidly, and eventually the system will begin to swap:


procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  1 1510832 2531244  40056 495316    6    0     8 47148 5634 9301 12  5 43 40  0	
 0  1 1510824 2529200  40080 495344   13    0    16 48444 5848 9650 17  5 37 41  0	
 6  1 1510824 2530864  40104 495372    0    0     3 48362 5706 9426 16  5 37 42  0	
 0  2 1510816 2531484  40136 495480    8    0    12 48391 5790 9490 17  5 39 39  0	
 0  1 1510808 2530120  40160 495488   13    0    15 44006 5808 9452 17  6 34 44  0	
 0  1 1510796 2530500  40180 495496   13    0    15 46772 5616 9421  9  5 43 44  0	
 0  1 1510784 2522248  40204 502744   13    0    19 48774 5655 9527 10  6 40 43  0	
 1  0 1510780 2404360  40228 495560    0    0     2 35649 8693 37923 35 14 26 25  0	
 4  1 1510736 2217740  40244 495512   18    0    19 21073 10986 67445 57 20  7 16  0	
 6  0 1510728 2023004  40252 495636   12    0    14 22019 11210 69135 58 22  8 12  0	
 3  0 1510720 1825876  40268 495632    6    0     8 22386 11515 70729 58 22  7 13  0	
 6  0 1510712 1627096  40276 495608   13    0    13 22479 12008 69410 50 21  9 20  0	
 6  0 1510708 1417164  40292 495620    0    0     2 23700 12622 72527 51 22  9 17  0	
 7  0 1510704 1225444  40308 495612    6    0     8 21696 11177 69014 56 22  7 15  0	
 6  2 1510696 1023572  40320 495592   13    0    13 22731 11649 70505 59 23  6 12  0	
 6  0 1510692 829884  40336 495688    6    0     8 21942 11524 70244 57 22  8 14  0	
 3  0 1510688 713184  40352 498532    0    0     2 16703 9936 49884 60 17  9 13  0	


My system is running fedora 15, and has 4 gig of DRAM.  Once that log message appears, all free memory is exhausted within 5 minutes and the system get stuck in a swap storm until I manually kill the broker.

Comment 8 Leonid Zhaldybin 2011-10-09 07:02:59 UTC
The issue has been fixed. Tested on RHEL5.7 / RHEL6.1  i386 / x86_64 on
packages:

RHEL5.7:
python-qpid-0.10-1.el5
qpid-cpp-client-0.10-9.el5
qpid-cpp-client-devel-0.10-9.el5
qpid-cpp-client-devel-docs-0.10-9.el5
qpid-cpp-client-rdma-0.10-9.el5
qpid-cpp-client-ssl-0.10-9.el5
qpid-cpp-mrg-debuginfo-0.10-9.el5
qpid-cpp-server-0.10-9.el5
qpid-cpp-server-cluster-0.10-9.el5
qpid-cpp-server-devel-0.10-9.el5
qpid-cpp-server-rdma-0.10-9.el5
qpid-cpp-server-ssl-0.10-9.el5
qpid-cpp-server-store-0.10-9.el5
qpid-cpp-server-xml-0.10-9.el5
qpid-java-client-0.10-6.el5
qpid-java-common-0.10-6.el5
qpid-java-example-0.10-6.el5
qpid-tools-0.7.946106-12.el5

RHEL6.1:
python-qpid-0.10-1.el6.noarch                                                                                                                                                                        
python-qpid-qmf-0.10-10.el6
qpid-cpp-client-0.10-7.el6_1
qpid-cpp-client-devel-0.10-7.el6_1
qpid-cpp-client-devel-docs-0.10-7.el6_1
qpid-cpp-client-rdma-0.10-7.el6_1
qpid-cpp-client-ssl-0.10-7.el6_1
qpid-cpp-server-0.10-7.el6_1
qpid-cpp-server-cluster-0.10-7.el6_1
qpid-cpp-server-devel-0.10-7.el6_1
qpid-cpp-server-rdma-0.10-7.el6_1
qpid-cpp-server-ssl-0.10-7.el6_1
qpid-cpp-server-store-0.10-7.el6_1
qpid-cpp-server-xml-0.10-7.el6_1
qpid-java-client-0.10-11.el6
qpid-java-common-0.10-11.el6
qpid-java-example-0.10-11.el6
qpid-qmf-0.10-10.el6
qpid-tests-0.10-1.el6
qpid-tools-0.10-5.el6
ruby-qpid-0.7.946106-2.el6
ruby-qpid-qmf-0.10-10.el6

-> VERIFIED

Comment 9 Tomas Capek 2011-10-17 14:04:24 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Due to a regression, a memory leak was introduced that prevented the broker from correctly releasing messages. Consequently, the broker's memory footprint grew indefinitely. A patch has been provided to address this issue, and the memory leak no longer occurs in the described scenario.

Comment 10 errata-xmlrpc 2011-10-24 11:43:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1399.html