Bug 1151269 - Qpid-ha broker memory bloat due to leak of memory with RDMA.
Summary: Qpid-ha broker memory bloat due to leak of memory with RDMA.
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 3.0
Hardware: x86_64
OS: Linux
medium
urgent
Target Milestone: ---
: ---
Assignee: messaging-bugs
QA Contact: Messaging QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-10-10 01:57 UTC by euroford
Modified: 2021-03-03 23:07 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
cluster.conf (1.15 KB, application/xml)
2014-10-10 14:46 UTC, euroford
no flags Details
qpidd.conf (1.09 KB, text/plain)
2014-10-10 14:47 UTC, euroford
no flags Details
valgrind log (1.62 MB, text/plain)
2014-12-01 09:32 UTC, euroford
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 468932 0 medium CLOSED Federation over Rdma links will leak memory for every link established 2021-02-22 00:41:40 UTC

Internal Links: 468932

Description euroford 2014-10-10 01:57:43 UTC
Description of problem:
Broker memory bloat and crash at last.

Version-Release number of selected component (if applicable):
qpid-ha-0.22-49.el6
kernel-rt-3.10.33-rt32.51.el6rt

How reproducible:


Steps to Reproduce:
1.qpid-receive -b 'amqp:rdma:qpid-ha-server-ip' -a 'benchmark-0;{create:always,node:{x-declare:{arguments:{'qpid.replicate':all}}}}' --connection-options '{tcp-nodelay:true,reconnect:true,heartbeat:1}' --print-content no  -f
2.qpid-send -b 'amqp:rdma:qpid-ha-server-ip' -a 'benchmark-0;{create:always,node:{x-declare:{arguments:{'qpid.replicate':all}}}}' --messages 1000 --content-size 5120000 --send-rate 300 --report-total --report-header=no --timestamp=yes --sequence=no --durable False --connection-options '{tcp-nodelay:true,reconnect:true,heartbeat:1}'
3.wait until the broker memory bloat.

Actual results:
broker memory bloat and crash at last.

Expected results:
keep working.

Additional info:

Comment 1 Justin Ross 2014-10-10 11:08:37 UTC
Alan, please assess.

Comment 2 euroford 2014-10-10 14:36:02 UTC
In FDR IB env, not more than 10 minutes, qpid-ha will crash at 5MB messages@300Hz sending speed.

More bigger the message, more faster the qpid-ha will crash.

My server has 64MB RAM, I think qpid-ha will crash faster in less RAM servers.

Comment 3 euroford 2014-10-10 14:46:17 UTC
Created attachment 945661 [details]
cluster.conf

Comment 4 euroford 2014-10-10 14:47:31 UTC
Created attachment 945663 [details]
qpidd.conf

Comment 5 Alan Conway 2014-10-21 14:44:15 UTC
I tried to reproduce this with TCP. Running overnight I saw memory grow from 350M -> 650M but then appears to stabilize at that point. This is nothing like to the growth reported above. It seems likely that RDMA is likely a factor, further investigation with RDMA required.

Comment 6 euroford 2014-10-27 14:03:37 UTC
when qpid-ha works in IB env, it receives messages in RDMA mode, and relay these messages to replicate server in unbalance way without any flow control, it maybe the reason I think.

Comment 7 Andrew Stitcher 2014-10-27 14:46:04 UTC
Perhaps related is this long standing RDMA federation problem:
Bug 468932.

Comment 8 Alan Conway 2014-10-27 16:08:02 UTC
HA does use federation links but Bug 468932 says the leak is per link. euroford, does your test involve a lot of broker failures/disconnects? There would be a new link for each reconnect. From the description it sounds like you are just running sender/receiver against a static cluster with no failures, in which case bug a leak per link wouldn't explain the bloat. Even so there may still be a relationship between the bugs, but it would require another way of triggering the leak, or something else is happening that is causing a lot of reconnects.

Comment 9 euroford 2014-10-31 03:42:17 UTC
Hi Alan,
Yes, I just run sender/receiver against a two nodes qpid-ha cluster, all the method and configuration files were posted here.
Could you reproduce this bug in IB env?

Comment 10 Alan Conway 2014-10-31 14:44:13 UTC
OK, that's what I thought. So this does not on the surface look like it is directly caused by Bug 468932, but there may be some connection. I haven't had a chance to investigate on IB yet, I will update this bug as soon as I do.

Comment 11 euroford 2014-12-01 09:32:19 UTC
Created attachment 963194 [details]
valgrind log

I'm sure this bug is RDMA related.

Comment 12 euroford 2014-12-01 09:42:31 UTC
My hardware ENV is IBM flex system x240 + mellanox connectx-3 56Gbps dual port IB card, and use libibverbs-rocee-1.1.7-1.1.el6_5 + libmlx4-rocee-1.0.5-1.1.el6_5 in MRG repo.


Note You need to log in before you can comment on or make changes to this bug.