Bug 728586
Summary: | Broker memory bloat due to leak of messages. | ||
---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Ken Giusti <kgiusti> |
Component: | qpid-cpp | Assignee: | Ken Giusti <kgiusti> |
Status: | CLOSED ERRATA | QA Contact: | Leonid Zhaldybin <lzhaldyb> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 2.0 | CC: | esammons, freznice, iboverma, jross, lzhaldyb, tross |
Target Milestone: | 2.0.10 | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | qpid-cpp-0.10-7.el6_1 | Doc Type: | Bug Fix |
Doc Text: |
Due to a regression, a memory leak was introduced that prevented the broker from correctly releasing messages. Consequently, the broker's memory footprint grew indefinitely. A patch has been provided to address this issue, and the memory leak no longer occurs in the described scenario.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2011-10-24 11:43:16 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 742040, 744036, 745231 |
Description
Ken Giusti
2011-08-05 17:35:32 UTC
Fix submitted upstream: https://issues.apache.org/jira/browse/QPID-3394 http://svn.apache.org/viewvc?view=rev&rev=1153966 I've tried to reproduce this memory leak on RHEL5 and RHE6 using a python client which sends and receives durable messages. After running the test for four days, the results are as follows. RHEL5, i386: The memory consumption by the qpidd process increased slightly - by 8 KB to be precise (from 61444K to 61452K). According to the logged pmap output, the memory consumption was NOT changing gradually, it looks like one "step" up by 8 KB. After that change, it doesn't seem to be growing any more. The list of packages: python-qpid-0.10-1.el5 python-qpid-qmf-0.10-10.el5 qpid-cpp-client-0.10-8.el5 qpid-cpp-client-devel-0.10-8.el5 qpid-cpp-client-devel-docs-0.10-8.el5 qpid-cpp-client-ssl-0.10-8.el5 qpid-cpp-server-0.10-8.el5 qpid-cpp-server-cluster-0.10-8.el5 qpid-cpp-server-devel-0.10-8.el5 qpid-cpp-server-ssl-0.10-8.el5 qpid-cpp-server-store-0.10-8.el5 qpid-cpp-server-xml-0.10-8.el5 qpid-java-client-0.10-6.el5 qpid-java-common-0.10-6.el5 qpid-java-example-0.10-6.el5 qpid-qmf-0.10-10.el5 qpid-qmf-devel-0.10-10.el5 qpid-tools-0.10-6.el5 RHEL5, x86_64: Here the memory consumption increased by a significant number, from 223788K up to 289324K (by 64 MB exactly!). Again, from the pmap output it looks like one big "step" which occurred after running the test for 9 and a half hours or so. No more changes observed starting from that moment. RHEL6, i386: The memory consumption growth is more than a megabyte - 1304 KB (from 62540K to 63844K). There were two "steps", one by 280 KB, another by 1024 KB. The list of packages: python-qpid-0.10-1.el6.noarch python-qpid-qmf-0.10-10.el6.i686 qpid-cpp-client-0.10-6.el6.i686 qpid-cpp-client-devel-0.10-6.el6.i686 qpid-cpp-client-devel-docs-0.10-6.el6.noarch qpid-cpp-server-0.10-6.el6.i686 qpid-cpp-server-cluster-0.10-6.el6.i686 qpid-cpp-server-devel-0.10-6.el6.i686 qpid-cpp-server-store-0.10-6.el6.i686 qpid-cpp-server-xml-0.10-6.el6.i686 qpid-java-client-0.10-9.el6.noarch qpid-java-common-0.10-9.el6.noarch qpid-java-example-0.10-9.el6.noarch qpid-qmf-0.10-10.el6.i686 qpid-tools-0.10-5.el6.noarch RHE6, x86_64: The memory consumption increased significantly, by 16432 KB (more than 16 MB). It was growing (more or less) gradually during the first ~30 hours, and did not change from the reached maximum of 356492K during the next three days. Tested on RHEL5 with newer packages: python-qpid-0.10-1.el5 python-qpid-qmf-0.10-10.el5 qpid-cpp-client-0.10-9.el5 qpid-cpp-client-devel-0.10-9.el5 qpid-cpp-client-devel-docs-0.10-9.el5 qpid-cpp-client-ssl-0.10-9.el5 qpid-cpp-server-0.10-9.el5 qpid-cpp-server-cluster-0.10-9.el5 qpid-cpp-server-devel-0.10-9.el5 qpid-cpp-server-ssl-0.10-9.el5 qpid-cpp-server-store-0.10-9.el5 qpid-cpp-server-xml-0.10-9.el5 qpid-java-client-0.10-6.el5 qpid-java-common-0.10-6.el5 qpid-java-example-0.10-6.el5 qpid-qmf-0.10-10.el5 qpid-qmf-devel-0.10-10.el5 qpid-tools-0.10-6.el5 Results for i386 architecture are as follows: The test was run for 25 hours and during that time the memory consumption increased by 1MB, from 61444KB up to 62148. Results for x86_64 architecture: Again, after the test was run for 25 hours, the memory consumption increased by more than 65MB, from 223893KB to 290472KB. In short, the situation now looks significantly worse than on previous version of packages (see also my comment above): ------------------------------------------------------------ | i386 | x86_64 ------------------------------------------------------------ 0.10-8 | 8KB (four days) | 64MB (four days) ------------------------------------------------------------ 0.10-9 | 1MB (25 hours) | 65MB (25 hours) ------------------------------------------------------------ Commands which were used to get info from pmap logs: # ver8 beginning [root@lzhaldyb-rhel57i bz728586]# for i in $(ls pmap_ver8/pmap*.log); do tail -1 $i; done | head -15 | tail -10 | awk '{print substr($2,1,length($2)-1)}' | awk 'BEGIN{s=0}{s=s+($1+0)}END{print (s / (0.0 +NR))}' 61444 [root@lzhaldyb-rhel57x bz728586]# for i in $(ls pmap_ver8/pmap*.log); do tail -1 $i; done | head -15 | tail -10 | awk '{print substr($2,1,length($2)-1)}' | awk 'BEGIN{s=0}{s=s+($1+0)}END{print (s / (0.0 +NR))}' 223788 # ver8 end [root@lzhaldyb-rhel57i bz728586]# for i in $(ls pmap_ver8/pmap*.log); do tail -1 $i; done | head -162 | tail -10 | awk '{print substr($2,1,length($2)-1)}' | awk 'BEGIN{s=0}{s=s+($1+0)}END{print (s / (0.0 +NR))}' 61452 [root@lzhaldyb-rhel57x bz728586]# for i in $(ls pmap_ver8/pmap*.log); do tail -1 $i; done | head -162 | tail -10 | awk '{print substr($2,1,length($2)-1)}' | awk 'BEGIN{s=0}{s=s+($1+0)}END{print (s / (0.0 +NR))}' 289324 # ver9 beginning [root@lzhaldyb-rhel57i bz728586]# for i in $(ls pmap_ver9/pmap*.log); do tail -1 $i; done | head -15 | tail -10 | awk '{print substr($2,1,length($2)-1)}' | awk 'BEGIN{s=0}{s=s+($1+0)}END{print (s / (0.0 +NR))}' 61124 [root@lzhaldyb-rhel57x bz728586]# for i in $(ls pmap_ver9/pmap*.log); do tail -1 $i; done | head -15 | tail -10 | awk '{print substr($2,1,length($2)-1)}' | awk 'BEGIN{s=0}{s=s+($1+0)}END{print (s / (0.0 +NR))}' 223893 # ver9 end [root@lzhaldyb-rhel57i bz728586]# for i in $(ls pmap_ver9/pmap*.log); do tail -1 $i; done | head -162 | tail -10 | awk '{print substr($2,1,length($2)-1)}' | awk 'BEGIN{s=0}{s=s+($1+0)}END{print (s / (0.0 +NR))}' 62148 [root@lzhaldyb-rhel57x bz728586]# for i in $(ls pmap_ver9/pmap*.log); do tail -1 $i; done | head -162 | tail -10 | awk '{print substr($2,1,length($2)-1)}' | awk 'BEGIN{s=0}{s=s+($1+0)}END{print (s / (0.0 +NR))}' 290472 -> ASSIGNED This bug was raised due to a specific - very serious - error that prevented the broker from freeing up durable messages until the sending session is closed. The bug presented itself by causing the broker's memory use to grow linearly in relation to the _amount_ of message data that was sent through it. For example, if you create a sender that passes 4 gigabytes of message data through the broker, and have a receiver that consumes all those messages, the bug made the broker still hold those 4 gigabytes of messages even after they were consumed. The bug kept the memory reserved until the _sender_ closes. The best way to reproduce this problem would be to have a sender generate very large durable messages continually, and a simple consumer that pulls off the messages. Since the messages are being consumed at roughly the same rate as they are being generated, the amount of memory used to hold these messages should be fairly stable. After generating several gigabytes of message traffic, the sender should stop *but keep the sender's session alive* - do not close it. Verify that the queue is drained by the consumer. Once the queue is drained, the broker's overall memory usage should be reasonable. There should be a dramatic difference in broker memory usage when running that test pre- and post-fix. The memory growth described in the previous comments doesn't sound similar to the bug that this BZ addresses. The described behavior sounds much more gradual than this target bug. Though the described behavior should be investigated, I think it is a different issue. More specific detail on how-to reproduce this bug: This bug is a hard one to reproduce :) - but I think I've got a specific example (that at least forces it to occur on my particular laptop) First, I run "vmstat 5" in a window so I can monitor memory usage in real-time: [kgiusti@localhost ~]$ vmstat 5 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 4 0 1608824 2751516 34672 454152 7 13 41 138 164 140 7 2 89 1 0 4 0 1608800 2751996 34680 454080 26 0 31 27 5535 9243 20 17 62 1 0 0 1 1608588 2738068 34836 466936 126 0 155 207 5872 9727 26 12 57 4 0 2 1 1608456 2744804 34860 457296 107 0 226 18968 5595 9401 18 14 48 21 0 Then I run a copy of the broker with the following options: ./qpidd --auth no --load-module /home/kgiusti/Desktop/work/store/cpp/lib/.libs/msgstore.so --store-dir ~/tmp/qpid --jfile-size 512 --num-jfiles 32 --log-enable info+ (be sure the --store-dir is empty - I'll do "rm -r ~/tmp/qpid/*" between runs) Once that broker is running, run the following qpid-perftest against it: ./qpid-perftest -s --iterations 5 --count 500000 --durable yes --npubs 4 --qt 4 --nsubs 4 (qpid-perftest can be found in qpid/cpp/src/tests dir) Watch the output from qpidd - after a minute or two you'll see the following log mesages: 2011-10-07 14:09:06 info Queue "qpid-perftest1": Flow limit created: flowStopCount=0, flowResumeCount=0, flowStopSize=0, flowResumeSize=0 2011-10-07 14:09:06 notice Journal "qpid-perftest2": Created 2011-10-07 14:09:28 info Queue "qpid-perftest2": Flow limit created: flowStopCount=0, flowResumeCount=0, flowStopSize=0, flowResumeSize=0 2011-10-07 14:09:28 notice Journal "qpid-perftest3": Created 2011-10-07 14:09:50 info Queue "qpid-perftest3": Flow limit created: flowStopCount=0, flowResumeCount=0, flowStopSize=0, flowResumeSize=0 Once that log msg regarding qpid-perftest3 appears, look back a the vmstat window. Within a minute or two, the amount of free memory will drop rapidly, and eventually the system will begin to swap: procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 3 1 1510832 2531244 40056 495316 6 0 8 47148 5634 9301 12 5 43 40 0 0 1 1510824 2529200 40080 495344 13 0 16 48444 5848 9650 17 5 37 41 0 6 1 1510824 2530864 40104 495372 0 0 3 48362 5706 9426 16 5 37 42 0 0 2 1510816 2531484 40136 495480 8 0 12 48391 5790 9490 17 5 39 39 0 0 1 1510808 2530120 40160 495488 13 0 15 44006 5808 9452 17 6 34 44 0 0 1 1510796 2530500 40180 495496 13 0 15 46772 5616 9421 9 5 43 44 0 0 1 1510784 2522248 40204 502744 13 0 19 48774 5655 9527 10 6 40 43 0 1 0 1510780 2404360 40228 495560 0 0 2 35649 8693 37923 35 14 26 25 0 4 1 1510736 2217740 40244 495512 18 0 19 21073 10986 67445 57 20 7 16 0 6 0 1510728 2023004 40252 495636 12 0 14 22019 11210 69135 58 22 8 12 0 3 0 1510720 1825876 40268 495632 6 0 8 22386 11515 70729 58 22 7 13 0 6 0 1510712 1627096 40276 495608 13 0 13 22479 12008 69410 50 21 9 20 0 6 0 1510708 1417164 40292 495620 0 0 2 23700 12622 72527 51 22 9 17 0 7 0 1510704 1225444 40308 495612 6 0 8 21696 11177 69014 56 22 7 15 0 6 2 1510696 1023572 40320 495592 13 0 13 22731 11649 70505 59 23 6 12 0 6 0 1510692 829884 40336 495688 6 0 8 21942 11524 70244 57 22 8 14 0 3 0 1510688 713184 40352 498532 0 0 2 16703 9936 49884 60 17 9 13 0 My system is running fedora 15, and has 4 gig of DRAM. Once that log message appears, all free memory is exhausted within 5 minutes and the system get stuck in a swap storm until I manually kill the broker. The issue has been fixed. Tested on RHEL5.7 / RHEL6.1 i386 / x86_64 on packages: RHEL5.7: python-qpid-0.10-1.el5 qpid-cpp-client-0.10-9.el5 qpid-cpp-client-devel-0.10-9.el5 qpid-cpp-client-devel-docs-0.10-9.el5 qpid-cpp-client-rdma-0.10-9.el5 qpid-cpp-client-ssl-0.10-9.el5 qpid-cpp-mrg-debuginfo-0.10-9.el5 qpid-cpp-server-0.10-9.el5 qpid-cpp-server-cluster-0.10-9.el5 qpid-cpp-server-devel-0.10-9.el5 qpid-cpp-server-rdma-0.10-9.el5 qpid-cpp-server-ssl-0.10-9.el5 qpid-cpp-server-store-0.10-9.el5 qpid-cpp-server-xml-0.10-9.el5 qpid-java-client-0.10-6.el5 qpid-java-common-0.10-6.el5 qpid-java-example-0.10-6.el5 qpid-tools-0.7.946106-12.el5 RHEL6.1: python-qpid-0.10-1.el6.noarch python-qpid-qmf-0.10-10.el6 qpid-cpp-client-0.10-7.el6_1 qpid-cpp-client-devel-0.10-7.el6_1 qpid-cpp-client-devel-docs-0.10-7.el6_1 qpid-cpp-client-rdma-0.10-7.el6_1 qpid-cpp-client-ssl-0.10-7.el6_1 qpid-cpp-server-0.10-7.el6_1 qpid-cpp-server-cluster-0.10-7.el6_1 qpid-cpp-server-devel-0.10-7.el6_1 qpid-cpp-server-rdma-0.10-7.el6_1 qpid-cpp-server-ssl-0.10-7.el6_1 qpid-cpp-server-store-0.10-7.el6_1 qpid-cpp-server-xml-0.10-7.el6_1 qpid-java-client-0.10-11.el6 qpid-java-common-0.10-11.el6 qpid-java-example-0.10-11.el6 qpid-qmf-0.10-10.el6 qpid-tests-0.10-1.el6 qpid-tools-0.10-5.el6 ruby-qpid-0.7.946106-2.el6 ruby-qpid-qmf-0.10-10.el6 -> VERIFIED Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Due to a regression, a memory leak was introduced that prevented the broker from correctly releasing messages. Consequently, the broker's memory footprint grew indefinitely. A patch has been provided to address this issue, and the memory leak no longer occurs in the described scenario. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2011-1399.html |