Hide Forgot
Description of problem: On HA cluster, when having 32k queues, there are scenarios leading to VERY frequent errors: 2016-10-13 13:05:45 [Protocol] error Execution exception: invalid-argument: session.detach: incorrect session name: qpid.bridge_session_qpid.replicator-qqq+923+qqq.4a06f404-e868-48e0-b595-4c087db6ab94_c0356bcf-180c-43b6-a26b-d71b09e16420, expecting: qpid.bridge_session_qpid.replicator-qqq+301+qqq.7269dd81-9c95-424e-8064-1bff4fdb231d_c0356bcf-180c-43b6-a26b-d71b09e16420 (qpid/amqp_0_10/SessionHandler.cpp:183) with unknown impact (also potential message loss in some scenario). Generic reproducer: - have ~32k queues that are replicated to backup brokers. These replication subscriptions consume almost all 32768 channel IDs available on the AMQP connection (for the federation link from backup to primary) - have ~100 queues being created and deleted in parallel, such that remaining free channel IDs are being re-used often - put the brokers under some load - in my case the bouncing of 100 queues was enough; simply the TCP connections to backup brokers wont manage to deliver all subscription news/events / backup brokers wont be able to consume all of them on time - this causes TCP connections for the federations being sometimes dropped - stop the load (or minimize it, to let brokers time to recover) - primary broker still sending huge amount of data to backups, the TCP connections / underlying sockets have big Recv-Q or Send-Q (seen in netstat -anp output) until the TCP connections are dropped again Version-Release number of selected component (if applicable): qpid-cpp-* 0.18-39 How reproducible: 100% after some time Steps to Reproduce: 1. start 3nodes HA cluster 2. create 32600 queues being replicated to backups 3. run script to create&delete more queues: MAXQUEUES=1000 MAXTIMEOUT=5 MAXCOMSUMERS=200 i=0 while true; do if [ $(pgrep qpid-receive | wc -l) -lt $MAXCOMSUMERS ]; then if [ $(($((i))%2)) -eq 0 ]; then qParam="{create:always, node:{x-declare:{auto-delete:True, arguments:{'qpid.auto_delete_timeout':1}}}}" else qParam="{create:always, delete:always}" fi qpid-receive -a "qqq+$(($((RANDOM))%$((MAXQUEUES))))+qqq; $qParam" -b localhost:5672,localhost:5673,localhost:5674 \ --timeout=$(($(($((RANDOM))%$((MAXTIMEOUT))))+1)) --connection-options "{reconnect:true}" > /dev/null 2>&1 & sleep 0.01 else sleep 0.5 fi i=$(($((i))+1)) if [ $i -eq 20 ]; then pgrep qpid-receive | wc i=0 fi done 4. After few minutes, kill the latest script (ignore it's output). 5. Wait a minute and grep for "incorrect session name" in qpid logs. Actual results: tons of errors like above, potential other impact unknown ATM Expected results: no such errors Additional info:
so far identified patches needed to backport: 1) Just 32k queue practical limit (and no check if the limit is reached) - fixed by [1], present in 0.30 2) Old bridge keeps track of session map even for a deleted connection - new connection must rebuild such map - fixed by [2], present in 0.34 3) regression introduced by [2] - fixed in [3] [1] https://github.com/apache/qpid-cpp/commit/573cff7d2b9095871de5a6dd1c987f31c4fb8d01 [2] https://github.com/apache/qpid-cpp/commit/f44aed6258cc9b2183561329e5255db8bb63800e [3] https://github.com/apache/qpid-cpp/commit/27f7eec9a49bd0185c4fe7a8a6cfdc2ab9143271
comment for QE if verification will be required: due to some fix, the threshold to hit the bug (if the bug wont be properly fixed) is moved by the bugfix from 32k queues to 64k queues. So worth trying original scenario and also with step: ./create_many_queues.sh 1 65368
Fixed in a single rolled up commit on 0.18-mrg-MCC-BZ1387541. The 3 commits involved all stepped on each other so it was clearer that way. There was some refactoring to work around but the final changes are quite small. 874c7541912ed1b963182b67a02174812555eb74 Author: Alan Conway <aconway> AuthorDate: Thu Jan 9 22:12:25 2014 +0000 Commit: Alan Conway <aconway> CommitDate: Wed Jan 18 18:49:20 2017 -0500 Parent: 18b4324 Bug 1387541,QPID-5454: Sporadic core dump in ha_tests.ReplicationTests.test_auto_delete_failover Merged: 0.18-mrg-MCC-BZ1387541 Containing: 0.18-mrg-MCC-BZ1387541 Follows: mrg-2.5.17-ga (5) Bug 1387541 - session mismatch on replication links when having 32k queues This is a combined backport of commits for the following issues: QPID-5513: HA backup fails if number of replicated queues exceeds number of channels. The problem: - create cluster of 2 brokers. - create more than 32768 queues (exceeds number of channels on a connection) - backup exits with critical error but - client creating queues receives no error, primary continues with unreplicated queue. The solution: Primary raises an error to the client if it attempts to create queues in excess of the channel limit. The queue is not created on primary or backup, primary and backup continue as normal. In addition: raised the channel limit from 32k to 64k. There was no reason for the smaller limit. See discussion: http://qpid.2158936.n2.nabble.com/CHANNEL-MAX-and-CHANNEL-HIGH-BIT-question-tp7603121p7603138.html New unit test to reproduce the issue, must create > 64k queues. Other minor improvements: - brokertest framework doesn't override --log options in the arguments. - increased default heartbeat in test framework for tests that have busy brokers. git-svn-id: https://svn.apache.org/repos/asf/qpid/trunk/qpid@1561206 13f79535-47bb-0310-9956-ffa450edef68 Bug 1387541,QPID-5513: Fix windows link error from previous commit. Fix link error (incorrect libraries listed) in ha_test_max_queues on windows. git-svn-id: https://svn.apache.org/repos/asf/qpid/trunk@1561754 13f79535-47bb-0310-9956-ffa450edef68 Bug 1387541,QPID-5666: HA fails with resource-limit-exceeded: Exceeded replicated queue limit This is regression introduced in r1561206: CommitDate: Fri Jan 24 21:54:59 2014 +0000 QPID-5513: HA backup fails if number of replicated queues exceeds number of channels. Fixed by the current commit. PrimaryQueueLimits was not taking account of queues already on the broker prior to promotion. git-svn-id: https://svn.apache.org/repos/asf/qpid/trunk@1585507 13f79535-47bb-0310-9956-ffa450edef68
I reduced the internal limit to 61534 (hard limit - 4000) I'm able to run your test (starting from 61400 queues) for an hour with no backup failing. The hard limit is the 64k channel limit on AMQP connection. The "channel pool" error is because there is a delay between deleting the queue and closing the channel - so we temporarily have more channels than queues when queues are created & deleted rapidly. By leaving the 4000 "spare" channels we should be safe in any realistic situation. http://git.app.eng.bos.redhat.com/git/rh-qpid.git/commit/?h=0.18-mrg-MCC-BZ1387541&id=a637bf879029b13e361853a987758ac069a2a968 commit a637bf879029b13e361853a987758ac069a2a968 Author: Alan Conway <aconway> Commit: Alan Conway <aconway> Bug 1387541 - lower limit on queue replication Queue replication is limited by the 64k channel limit on the primary/backup connection. Previous fixes ensure the primary never creates more than 64k queues. However there is a delay between deleting a queue and closing its channel, so when queues are created/deleted very rapidly its possible to temporarily have more channels than queues which can overrun the limit. Reduced the limit to reserve 4000 "spare" channels. Testing suggests this is enough to prevent overrun in any likely scenario.
(In reply to Alan Conway from comment #6) > I reduced the internal limit to 61534 (hard limit - 4000) I'm able to run > your test (starting from 61400 queues) for an hour with no backup failing. > > The hard limit is the 64k channel limit on AMQP connection. The "channel > pool" error is because there is a delay between deleting the queue and > closing the channel - so we temporarily have more channels than queues when > queues are created & deleted rapidly. By leaving the 4000 "spare" channels > we should be safe in any realistic situation. > > http://git.app.eng.bos.redhat.com/git/rh-qpid.git/commit/?h=0.18-mrg-MCC- > BZ1387541&id=a637bf879029b13e361853a987758ac069a2a968 > > commit a637bf879029b13e361853a987758ac069a2a968 > Author: Alan Conway <aconway> > Commit: Alan Conway <aconway> > > Bug 1387541 - lower limit on queue replication > > Queue replication is limited by the 64k channel limit on the > primary/backup > connection. Previous fixes ensure the primary never creates more than 64k > queues. However there is a delay between deleting a queue and closing its > channel, so when queues are created/deleted very rapidly its possible to > temporarily have more channels than queues which can overrun the limit. > > Reduced the limit to reserve 4000 "spare" channels. Testing suggests > this is > enough to prevent overrun in any likely scenario. In the relevant build, I cant reproduce the problematic behaviour further.