Description of problem: Having a ring queue in a primary broker that just enqueues new messages from many producers, the broker apparently starts to consume more and more memory. The growth is quite slow, valgrind does not show anything. Sounds similar to bz1136294 but that cause has been ruled out. Version-Release number of selected component (if applicable): qpid-cpp-server-ha-0.18-35.el6.x86_64 qpid-cpp-debuginfo-0.18-35.el6.x86_64 qpid-cpp-client-0.18-35.el6.x86_64 qpid-cpp-server-cluster-0.18-35.el6.x86_64 qpid-cpp-server-store-0.18-35.el6.x86_64 qpid-cpp-server-0.18-35.el6.x86_64 (will try on 0.18-40 and on 0.34-5 as well) How reproducible: 100% in one day Steps to Reproduce: (replace "train1" by your hostname) 1. # cat start_ha_brokers.sh start_broker() { port=$1 shift rm -rf _${port} mkdir _${port} if [ $port -eq 5675 ]; then cmd="valgrind --leak-check=full --show-reachable=yes"; out="valgrind.txt.$$"; else cmd=nohup; out=/dev/null; fi $cmd qpidd --port=$port --store-dir=_${port} --log-to-file=qpidd.$port.log --data-dir=_${port} --auth=no --log-to-stderr=no --ha-cluster=yes --ha-brokers-url="train1:5672,train1:5673,train1:5674" --ha-replicate=all "$@" > $out 2>&1 & sleep 1 } killall qpidd qpid-receive 2> /dev/null rm -f qpidd.*.log start_broker 5672 sleep 1 qpid-ha promote -b train1:5672 sleep 1 start_broker 5673 sleep 1 start_broker 5674 2. # cat reproducer.sh for i in $(seq 0 9); do qpid-config add queue FromKeyServer_$i --max-queue-size=10000 --max-queue-count=10 --limit-policy=ring --argument x-qpid-priorities=10 2> /dev/null #--argument qpid.ha-uuid=de7a4a4d-badb-4111-a722-d7b194765d58 > /dev/null 2>&1 qpid-config bind amq.fanout FromKeyServer_$i done while true; do for j in $(seq 1 10); do for i in $(seq 1 10); do for k in $(seq 0 9); do qpid-send -a FromKeyServer_$k -m 100 --priority $(($((RANDOM))%10)) --send-rate=50 & done done wait while [ $(qpid-stat -q | grep broker-replicator | sed "s/Y//g" | awk '{ print $2 }' | sort -n | tail -n1) != "0" ]; do sleep 1 done done date ps aux | grep qpidd | grep "port=5672" | awk -F "--store-dir" '{ print $1 }' done 3. Kick off HA brokers and then run the reproducer: ./start_ha_brokers.sh nohup ./reproducer.sh & 4. Wait one day and check nohup.out for memory usage Actual results: 4. memory constantly grows up. Despite the reproducer sends messages in considerably bursts to have federation routes empty after every while (see "while [ $(qpid-stat -q .." that waits until backup brokers are in sync, having no message in "their" federation queue on primary broker). Expected results: 4. no memory growth Additional info: - testing several alternatives of the reproducer is ongoing (to find out what the reproducer depends on and what is idempotent to it) - we will update details - checking other qpid-cpp versions is also ongoing
Reproduced on these qpid-cpp versions: - 0.18-35 - 0.18-39 - 0.34-5 (here the mem.increase is *much* evident) Reproduced with either value of --ha-replicate=[all|configuration|none]
Asking for MRG 3.2.z as well since it appears there also (and in upstream also).
Possible fix on 0.34-mrg branch. This is definitely a leak and fits the symptoms, but I have not tested Pavel's reproducer. It is a regression since 014f0f3 QPID-4327: HA TX transactions: basic replication. so affects 0.32 and 0.30 commit 3cc32d76de7cfc29686ed9717c150fca9261fb4d Author: Alan Conway <aconway> Commit: Alan Conway <aconway> Bug 1318180 - HA memory leak in primary broker when overwriting messages in a ring queue ReplicatingSubscription accumulates IDs of dequeued messages to send on dispatch. It should clear the accumulated IDs once sent. Due to a merge error, since: 014f0f3 QPID-4327: HA TX transactions: basic replication. The ID set is not cleared, causing it to accumulate memory slowly. This leak would be particularly noticeable on a busy ring-queue since a ring-queue generates a dequeue event for every enqueue once it reaches its max size. Also fixed a "switch case not handled" compile error when built against recent proton-C.
Upstream issue https://issues.apache.org/jira/browse/QPID-7150
(In reply to Alan Conway from comment #4) > Upstream issue https://issues.apache.org/jira/browse/QPID-7150 I have filed QPID-7149 for that already, with a reproducer modified to upstream. Will close the 7150. Anyway, though I bet r1735439 improves the leak in upstream (something I will test): - it does not prevent the leak there (leak still reproducible with --ha-replicate=none where the modified code isnt executed) - the fix can't be backported to 0.18-39 that has: // Called with lock held. Called in subscription's connection thread. void ReplicatingSubscription::sendDequeueEvent(Mutex::ScopedLock&) { if (dequeues.empty()) return; QPID_LOG(trace, logPrefix << "Sending dequeues " << dequeues); string buf(dequeues.encodedSize(),'\0'); framing::Buffer buffer(&buf[0], buf.size()); dequeues.encode(buffer); dequeues.clear(); buffer.reset(); { Mutex::ScopedUnlock u(lock); sendEvent(QueueReplicator::DEQUEUE_EVENT_KEY, buffer); } }
(In reply to Pavel Moravec from comment #5) > - the fix can't be backported to 0.18-39 that has: > > // Called with lock held. Called in subscription's connection thread. > void ReplicatingSubscription::sendDequeueEvent(Mutex::ScopedLock&) > { > if (dequeues.empty()) return; > QPID_LOG(trace, logPrefix << "Sending dequeues " << dequeues); > string buf(dequeues.encodedSize(),'\0'); > framing::Buffer buffer(&buf[0], buf.size()); > dequeues.encode(buffer); > dequeues.clear(); No need to backport the fix there, it has the 'dequeues.clear()' call, so is before that regression was introduced. From your observation there is another leak (even when --ha-replicate=none) that is probably affecting all versions. The fact that it happens with replication off narrows the search considerably. https://issues.apache.org/jira/browse/QPID-7149?focusedCommentId=15204363&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15204363
I think this is the real fix. on branch 0.34-mrg-aconway-bz1318180 ready to merge to 0.34-mrg if verified. http://git.app.eng.bos.redhat.com/git/rh-qpid.git/commit/?h=0.34-mrg-aconway-bz1318180&id=bf30340c63f8917023c97b68a3aff694533868d9 Bug 1318180 - [MRG2] fixed memory accumulation in Poller on HA primary0.34-mrg-aconway-bz1318180 The primary broker was accumulating PollerHandlers of closed connections due to a Poller issue if there is more than one Poller in a process. The HA primary uses the qpid::messaging client for cluster-internal communication. This starts another Poller if AMQP 1.0 is used, which is now the default. The Poller has global state (the DeletionManager) that accumulates PollerHandleImpls without limit if there is one active and one idle Poller in the process. The simple workaround is to the use AMQP 0-10 protocol for internal HA communication. It uses a different (older) mechanism that does not conflict with the broker's Poller. Also removed some dead code that was accumulating memory in Primary::addReplica.
(In reply to Alan Conway from comment #7) > I think this is the real fix. on branch 0.34-mrg-aconway-bz1318180 ready to > merge to 0.34-mrg if verified. > > http://git.app.eng.bos.redhat.com/git/rh-qpid.git/commit/?h=0.34-mrg-aconway- > bz1318180&id=bf30340c63f8917023c97b68a3aff694533868d9 Trying it on upstream (r1737001), I am afraid the patch does not help (havent tried it on 0.34-5, though). See on the machine (credentials in next private update) where I have: unpatched upstream (r1737001) under /data_xfs/qpid/cpp (where I run 3 brokers in cluster on ports 5672-5674) upstream (r1737001) patched by the above fix under /data_ext4/qpid/cpp (where I run 3 brokers in cluster on ports 6672-6674) See: diff -rup /data_xfs/qpid/cpp/src/qpid/ha/ /data_ext4/qpid/cpp/src/qpid/ha for the patch applied. Running the 2 clusters for comparison: killall bash qpid-send qpid-receive qpidd sh ./run_all_to_compare_patch_for_1318180.sh and monitor: bz1318180_reproducer_against_patched.sh.txt vs. bz1318180_reproducer.sh.txt both show increased RSS in ps output, i.e. still some memory consumption increase over the time. Some further observation: - running the original reproducer (10 ring queues fed in parallel), _backup_ brokers start to consume more and more CPU - both patched and unpatched. That sounds like another issue. Mem.leak was present, see /root/bz1318180_reproducer.sh.txt_prev and /root/bz1318180_reproducer_against_patched.sh.txt_prev . - running simplified reproducer with 1 ring queue fed sequentially - _no_ memory leak seen (though the test was run for a short time) - running a reproducer with 1 ring queue fed _in_parallel_, mem.leak present (this test is currently running there)
I have to apologize, my mistake in parallel building of upstream qpid (patched and unpatched) caused some mess in qpidd binary. Having proper build, the Alan's patch works. No memory leak is left behind when following the scenario. Just another issue found: both on patched and unpatched qpid, _backup_ brokers CPU consumption was growing over the time. Since this hasnt been reported by a customer and not sure how artificial the scenario is, I filed it as upstream QPID-7182 only.
(In reply to Pavel Moravec from comment #11) > I have to apologize, my mistake in parallel building of upstream qpid > (patched and unpatched) caused some mess in qpidd binary. Having proper > build, the Alan's patch works. No memory leak is left behind when following > the scenario. > > Just another issue found: both on patched and unpatched qpid, _backup_ > brokers CPU consumption was growing over the time. Since this hasnt been > reported by a customer and not sure how artificial the scenario is, I filed > it as upstream QPID-7182 only. Excellent, and I think you have already pinpointed the problem for QPID-7182 so I should have a patch for that shortly...
(In reply to Pavel Moravec from comment #11) > I have to apologize, my mistake in parallel building of upstream qpid > (patched and unpatched) caused some mess in qpidd binary. Having proper > build, the Alan's patch works. No memory leak is left behind when following > the scenario. That patch created other problems, I have replaced it with http://git.app.eng.bos.redhat.com/git/rh-qpid.git/commit/?h=0.34-mrg-aconway-bz1318180-1 This is also committed upstream. It fixes the same problem by a different route so hopefully will work as well as the previous one. > Just another issue found: both on patched and unpatched qpid, _backup_ > brokers CPU consumption was growing over the time. Since this hasnt been > reported by a customer and not sure how artificial the scenario is, I filed > it as upstream QPID-7182 only. I'm looking into that, I also see slow growth as identified by Pavel.
Ported to 0.18-mrg branch on http://git.app.eng.bos.redhat.com/git/rh-qpid.git/commit/?h=0.18-mrg-aconway-1318180 Bug 1318180: QPID-7149 - [MRG2] Fix memory accumulation in Poller with HA.0.18-mrg-aconway-1318180 Combines the following two upstream commits: QPID-7149: Active HA broker memory leak The leak was caused by the sys::Poller. When two pollers run in the same process and one is idles, PollerHandles accumulate on the idle poller threads and are never released. The HA broker uses the qpid::messaging API to get initial status from other brokers. The messaging API creates a separate Poller from the broker. The fix is to shut down the qpid::messaging poller as soon as initial status checks are complete so it does not interfere with the broker's Poller. git-svn-id: https://svn.apache.org/repos/asf/qpid/trunk@1737852 13f79535-47bb-0310-9956-ffa450edef68 QPID-7149: Fixed messaging client shutdown. Fix previous commit that did not respect the plugin boundaries for messaging::shutdown.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-2049.html