Description of problem: Broker memory bloat and crash at last. Version-Release number of selected component (if applicable): qpid-ha-0.22-49.el6 kernel-rt-3.10.33-rt32.51.el6rt How reproducible: Steps to Reproduce: 1.qpid-receive -b 'amqp:rdma:qpid-ha-server-ip' -a 'benchmark-0;{create:always,node:{x-declare:{arguments:{'qpid.replicate':all}}}}' --connection-options '{tcp-nodelay:true,reconnect:true,heartbeat:1}' --print-content no -f 2.qpid-send -b 'amqp:rdma:qpid-ha-server-ip' -a 'benchmark-0;{create:always,node:{x-declare:{arguments:{'qpid.replicate':all}}}}' --messages 1000 --content-size 5120000 --send-rate 300 --report-total --report-header=no --timestamp=yes --sequence=no --durable False --connection-options '{tcp-nodelay:true,reconnect:true,heartbeat:1}' 3.wait until the broker memory bloat. Actual results: broker memory bloat and crash at last. Expected results: keep working. Additional info:
Alan, please assess.
In FDR IB env, not more than 10 minutes, qpid-ha will crash at 5MB messages@300Hz sending speed. More bigger the message, more faster the qpid-ha will crash. My server has 64MB RAM, I think qpid-ha will crash faster in less RAM servers.
Created attachment 945661 [details] cluster.conf
Created attachment 945663 [details] qpidd.conf
I tried to reproduce this with TCP. Running overnight I saw memory grow from 350M -> 650M but then appears to stabilize at that point. This is nothing like to the growth reported above. It seems likely that RDMA is likely a factor, further investigation with RDMA required.
when qpid-ha works in IB env, it receives messages in RDMA mode, and relay these messages to replicate server in unbalance way without any flow control, it maybe the reason I think.
Perhaps related is this long standing RDMA federation problem: Bug 468932.
HA does use federation links but Bug 468932 says the leak is per link. euroford, does your test involve a lot of broker failures/disconnects? There would be a new link for each reconnect. From the description it sounds like you are just running sender/receiver against a static cluster with no failures, in which case bug a leak per link wouldn't explain the bloat. Even so there may still be a relationship between the bugs, but it would require another way of triggering the leak, or something else is happening that is causing a lot of reconnects.
Hi Alan, Yes, I just run sender/receiver against a two nodes qpid-ha cluster, all the method and configuration files were posted here. Could you reproduce this bug in IB env?
OK, that's what I thought. So this does not on the surface look like it is directly caused by Bug 468932, but there may be some connection. I haven't had a chance to investigate on IB yet, I will update this bug as soon as I do.
Created attachment 963194 [details] valgrind log I'm sure this bug is RDMA related.
My hardware ENV is IBM flex system x240 + mellanox connectx-3 56Gbps dual port IB card, and use libibverbs-rocee-1.1.7-1.1.el6_5 + libmlx4-rocee-1.0.5-1.1.el6_5 in MRG repo.