Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1609230

Summary: memory leak on 1.36 on randomly overwriting priority messages in ring prio queue
Product: Red Hat Enterprise MRG Reporter: Pavel Moravec <pmoravec>
Component: qpid-cppAssignee: Mike Cressman <mcressma>
Status: CLOSED ERRATA QA Contact: Zdenek Kraus <zkraus>
Severity: high Docs Contact:
Priority: high    
Version: 3.2CC: aconway, gsim, jross, mcressma, pmoravec, zkraus
Target Milestone: 3.2.11   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: qpid-cpp-1.36.0-20 Doc Type: Bug Fix
Doc Text:
Cause: When a priority ring queue is filled to overflowing with messages of various priorities, and has no consumer, but only a browsing client, Consequence: an internal data structure accumulates as the messages are delivered to the client, causing a slow memory leak. Fix: The internal data is now released when the message is delivered to the browsing client Result: and the memory leak is no longer present.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-11 17:20:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Massif diagram of qpidd memory during 1.6M message reproducer
none
Raw massif data for 1.6m message reproducer none

Description Pavel Moravec 2018-07-27 10:26:19 UTC
Description of problem:
When finding a reproducer for bz1609227, I discovered there is a small but steady memory accumulation in one scenario when one randomly sends priority messages to a priority ring queue with browser and no consumer.

This bug is not present in 0.18. It affects just primary broker, backup brokers have stable memory footprint.


Version-Release number of selected component (if applicable):
qpid-cpp-server-1.36.0-11.el7.x86_64


How reproducible:
100%


Steps to Reproduce:
- run this script:

killall qpid-send qpid-receive

queue=RingPrio
qpid-receive -a "${queue}; {create:always, node:{ type:queue, x-bindings:[{exchange: 'amq.direct', queue: '${queue}', key:'${queue}'}], x-declare:{'alternate-exchange':'amq.fanout', arguments:{'qpid.max_count':1000, 'qpid.max_size':1000000, 'qpid.policy_type':'ring', 'x-qpid-priorities':10}}}}" --print-content=no --capacity=1 --receive-rate=1

qpid-receive -a "${queue}; {mode:browse}" -f --print-content=no &

while true; do
	for i in $(seq 1 10); do
		qpid-send --priority=$i -m 100 -a "${queue}" &
	done
	while [ $(pgrep qpid-send | wc -l) -gt 0 ]; do
		sleep 0.1
	done
	sleep 0.01
done

- monitor memory footprint over time


Actual results:
memory grows by several MBs per minute


Expected results:
no memory growth


Additional info:

Comment 1 Alan Conway 2018-08-07 15:55:10 UTC
This seems consistent with the analysis of bug 1609227 - growth of a non-sparse MessgeDev because it allocates space for all message IDs between the oldest high-priority message and the latest message ID. 

Even with random priorities, the queue will start to fill up with priority 10 messages that rarely get deleted - the queue would have to overflow with all priority-10 messages. Meanwhile the random lower-priority messages will cause the MessageDev to grow.

Comment 2 Gordon Sim 2018-08-07 17:30:10 UTC
Eventually, the queue should be full of the highest priority messages with all lower priority messages deleted. At that point lower priority messages will never be enqueued. New highest priority messages will cause the oldest message to be removed, allowing the queue to be cleaned.

So my hypothesis is that the largest that the fifo index can grow to in this scenario is ring-size * priority level. Am I wrong?

This does seem to me to be the case if I leve the reproducer running. Memory does indeed grow to begin with, but not indefinitely (though I haven't been running it all that long, I have seen the memory go down as well as up and I have not yet seen RSS go above 120000).

Comment 3 Alan Conway 2018-08-07 17:44:15 UTC
(In reply to Gordon Sim from comment #2)
> Eventually, the queue should be full of the highest priority messages with
> all lower priority messages deleted. At that point lower priority messages
> will never be enqueued. New highest priority messages will cause the oldest
> message to be removed, allowing the queue to be cleaned.
> 
> So my hypothesis is that the largest that the fifo index can grow to in this
> scenario is ring-size * priority level. Am I wrong?

You are correct, I was to hasty. On longer runs under massif I see that the queue storage does grow for periods, then drops suddenly as an old high-priority message is dequeued. It doesn't grow without limit, although memory use is a bit of a roller coaster.

HOWEVER: something else is going on. Underneath the randomness, Massif shows perfectly linear growth from:

qpid::broker::SemanticState::record(qpid::broker::DeliveryRecord const&) 

So it looks like the browser is somehow generating a build-up of unacked records.

I'll attach the massif diagram.

Comment 4 Alan Conway 2018-08-07 17:46:54 UTC
Created attachment 1474068 [details]
Massif diagram of qpidd memory during 1.6M message reproducer

The orange wedge at the bottom is the unacked growth.

Comment 5 Alan Conway 2018-08-07 17:48:52 UTC
Created attachment 1474069 [details]
Raw massif data for 1.6m message reproducer

Get massif-visualizer if you haven't already, it's the bomb!

Comment 6 Gordon Sim 2018-08-07 21:28:05 UTC
I believe the accumulation of delivery records should be fixed by https://issues.apache.org/jira/browse/QPID-8226.

Comment 7 Mike Cressman 2018-08-13 13:53:27 UTC
Targeting for upcoming MRG 3.2.11 release.

Comment 9 Zdenek Kraus 2018-08-21 11:28:25 UTC
I have ran Pavel's reproducer on RHEL7 64b and RHEL6 64b, and it works as expected. But on RHEL6 32b there seems to be more rapid memory leak than on previous version on 64b.

qpid-cpp-server-1.36.0-20.el6_10.i686
glibc-2.12-1.212.el6.i686
kernel-2.6.32-754.el6.i686

Is there something specific I can measure for you ?

Another thing, when I kill the pavel's reproducer, I cannot reuse the same broker again. I mean the same reproducer (same queue) does not run. After restart it's ok. Again only on 32b. 64b works as expected.

Comment 10 Zdenek Kraus 2018-08-22 11:48:10 UTC
I have compared the speed of memory allocation between qpid-cpp-server-1.36.0-15 and -20 on RHEL6 i386, and the -20 build has extremely higher allocation rate, ~50 MB in few seconds of the reproducer running. Where as -15 has rate of 10MB per minute.

So this looks like a regression for i386

-> ASSIGNED, FailedQA

Comment 11 Pavel Moravec 2018-08-30 20:33:50 UTC
(In reply to Zdenek Kraus from comment #9)
> I have ran Pavel's reproducer on RHEL7 64b and RHEL6 64b, and it works as
> expected. But on RHEL6 32b there seems to be more rapid memory leak than on
> previous version on 64b.
> 
> qpid-cpp-server-1.36.0-20.el6_10.i686
> glibc-2.12-1.212.el6.i686
> kernel-2.6.32-754.el6.i686
> 
> Is there something specific I can measure for you ?
> 
> Another thing, when I kill the pavel's reproducer, I cannot reuse the same
> broker again. I mean the same reproducer (same queue) does not run. After
> restart it's ok. Again only on 32b. 64b works as expected.

I cant reproduce this observation (mem.leak on 1.36.0-20).

My machine: rdma-dev-10.lab.bos.redhat.com (beaker's default root password)

On 1.36.0-15, running the script showed memory growth few MBs per minute.

On 1.36.0-20, RSS was oscilating between 18MB and 25MB without any trend, for >30minutes.

Same qpid-cpp / glibc / kernel packages used.

Comment 12 Zdenek Kraus 2018-08-31 10:04:37 UTC
So I have picked up another VM from pool and I cannot reproduce it either, and I have lent the original VM to Mike. So I don't want to disturb that one.

Comment 13 Zdenek Kraus 2018-09-03 11:16:10 UTC
It was false negative, alternate-exchange settings in the reproducer and some leftover queues on tested instance cause allocated memory to skyrocket.

I've updated the reproducer and re-tested and everything seems ok.

Sorry for the false alarm, and thanks for checking.



Fix was tested on RHEL 6 i686, x86_64, RHEL 7 x86_64 with following packages:

qpid-cpp-server-1.36.0-20

fix work as expected

-> VERIFIED

Comment 15 errata-xmlrpc 2018-09-11 17:20:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2680