Bug 1609230
| Summary: | memory leak on 1.36 on randomly overwriting priority messages in ring prio queue | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Pavel Moravec <pmoravec> | ||||||
| Component: | qpid-cpp | Assignee: | Mike Cressman <mcressma> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Zdenek Kraus <zkraus> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 3.2 | CC: | aconway, gsim, jross, mcressma, pmoravec, zkraus | ||||||
| Target Milestone: | 3.2.11 | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | qpid-cpp-1.36.0-20 | Doc Type: | Bug Fix | ||||||
| Doc Text: |
Cause:
When a priority ring queue is filled to overflowing with messages of various priorities, and has no consumer, but only a browsing client,
Consequence:
an internal data structure accumulates as the messages are delivered to the client, causing a slow memory leak.
Fix:
The internal data is now released when the message is delivered to the browsing client
Result:
and the memory leak is no longer present.
|
Story Points: | --- | ||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2018-09-11 17:20:33 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Pavel Moravec
2018-07-27 10:26:19 UTC
This seems consistent with the analysis of bug 1609227 - growth of a non-sparse MessgeDev because it allocates space for all message IDs between the oldest high-priority message and the latest message ID. Even with random priorities, the queue will start to fill up with priority 10 messages that rarely get deleted - the queue would have to overflow with all priority-10 messages. Meanwhile the random lower-priority messages will cause the MessageDev to grow. Eventually, the queue should be full of the highest priority messages with all lower priority messages deleted. At that point lower priority messages will never be enqueued. New highest priority messages will cause the oldest message to be removed, allowing the queue to be cleaned. So my hypothesis is that the largest that the fifo index can grow to in this scenario is ring-size * priority level. Am I wrong? This does seem to me to be the case if I leve the reproducer running. Memory does indeed grow to begin with, but not indefinitely (though I haven't been running it all that long, I have seen the memory go down as well as up and I have not yet seen RSS go above 120000). (In reply to Gordon Sim from comment #2) > Eventually, the queue should be full of the highest priority messages with > all lower priority messages deleted. At that point lower priority messages > will never be enqueued. New highest priority messages will cause the oldest > message to be removed, allowing the queue to be cleaned. > > So my hypothesis is that the largest that the fifo index can grow to in this > scenario is ring-size * priority level. Am I wrong? You are correct, I was to hasty. On longer runs under massif I see that the queue storage does grow for periods, then drops suddenly as an old high-priority message is dequeued. It doesn't grow without limit, although memory use is a bit of a roller coaster. HOWEVER: something else is going on. Underneath the randomness, Massif shows perfectly linear growth from: qpid::broker::SemanticState::record(qpid::broker::DeliveryRecord const&) So it looks like the browser is somehow generating a build-up of unacked records. I'll attach the massif diagram. Created attachment 1474068 [details]
Massif diagram of qpidd memory during 1.6M message reproducer
The orange wedge at the bottom is the unacked growth.
Created attachment 1474069 [details]
Raw massif data for 1.6m message reproducer
Get massif-visualizer if you haven't already, it's the bomb!
I believe the accumulation of delivery records should be fixed by https://issues.apache.org/jira/browse/QPID-8226. Targeting for upcoming MRG 3.2.11 release. I have ran Pavel's reproducer on RHEL7 64b and RHEL6 64b, and it works as expected. But on RHEL6 32b there seems to be more rapid memory leak than on previous version on 64b. qpid-cpp-server-1.36.0-20.el6_10.i686 glibc-2.12-1.212.el6.i686 kernel-2.6.32-754.el6.i686 Is there something specific I can measure for you ? Another thing, when I kill the pavel's reproducer, I cannot reuse the same broker again. I mean the same reproducer (same queue) does not run. After restart it's ok. Again only on 32b. 64b works as expected. I have compared the speed of memory allocation between qpid-cpp-server-1.36.0-15 and -20 on RHEL6 i386, and the -20 build has extremely higher allocation rate, ~50 MB in few seconds of the reproducer running. Where as -15 has rate of 10MB per minute. So this looks like a regression for i386 -> ASSIGNED, FailedQA (In reply to Zdenek Kraus from comment #9) > I have ran Pavel's reproducer on RHEL7 64b and RHEL6 64b, and it works as > expected. But on RHEL6 32b there seems to be more rapid memory leak than on > previous version on 64b. > > qpid-cpp-server-1.36.0-20.el6_10.i686 > glibc-2.12-1.212.el6.i686 > kernel-2.6.32-754.el6.i686 > > Is there something specific I can measure for you ? > > Another thing, when I kill the pavel's reproducer, I cannot reuse the same > broker again. I mean the same reproducer (same queue) does not run. After > restart it's ok. Again only on 32b. 64b works as expected. I cant reproduce this observation (mem.leak on 1.36.0-20). My machine: rdma-dev-10.lab.bos.redhat.com (beaker's default root password) On 1.36.0-15, running the script showed memory growth few MBs per minute. On 1.36.0-20, RSS was oscilating between 18MB and 25MB without any trend, for >30minutes. Same qpid-cpp / glibc / kernel packages used. So I have picked up another VM from pool and I cannot reproduce it either, and I have lent the original VM to Mike. So I don't want to disturb that one. It was false negative, alternate-exchange settings in the reproducer and some leftover queues on tested instance cause allocated memory to skyrocket. I've updated the reproducer and re-tested and everything seems ok. Sorry for the false alarm, and thanks for checking. Fix was tested on RHEL 6 i686, x86_64, RHEL 7 x86_64 with following packages: qpid-cpp-server-1.36.0-20 fix work as expected -> VERIFIED Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2680 |