Bug 1387541

Summary: session mismatch on replication links when having 32k queues
Product: Red Hat Enterprise MRG Reporter: Pavel Moravec <pmoravec>
Component: qpid-cppAssignee: messaging-bugs <messaging-bugs>
Status: POST --- QA Contact: Messaging QE <messaging-qe-bugs>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 2.5CC: bvassova, jross
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Pavel Moravec 2016-10-21 08:40:39 UTC
Description of problem:
On HA cluster, when having 32k queues, there are scenarios leading to VERY frequent errors:

2016-10-13 13:05:45 [Protocol] error Execution exception: invalid-argument: session.detach: incorrect session name: qpid.bridge_session_qpid.replicator-qqq+923+qqq.4a06f404-e868-48e0-b595-4c087db6ab94_c0356bcf-180c-43b6-a26b-d71b09e16420, expecting: qpid.bridge_session_qpid.replicator-qqq+301+qqq.7269dd81-9c95-424e-8064-1bff4fdb231d_c0356bcf-180c-43b6-a26b-d71b09e16420 (qpid/amqp_0_10/SessionHandler.cpp:183)

with unknown impact (also potential message loss in some scenario). Generic reproducer:

- have ~32k queues that are replicated to backup brokers. These replication subscriptions consume almost all 32768 channel IDs available on the AMQP connection (for the federation link from backup to primary)
- have ~100 queues being created and deleted in parallel, such that remaining free channel IDs are being re-used often
- put the brokers under some load - in my case the bouncing of 100 queues was enough; simply the TCP connections to backup brokers wont manage to deliver all subscription news/events / backup brokers wont be able to consume all of them on time
- this causes TCP connections for the federations being sometimes dropped
- stop the load (or minimize it, to let brokers time to recover)
- primary broker still sending huge amount of data to backups, the TCP connections / underlying sockets have big Recv-Q or Send-Q (seen in netstat -anp output) until the TCP connections are dropped again


Version-Release number of selected component (if applicable):
qpid-cpp-* 0.18-39


How reproducible:
100% after some time


Steps to Reproduce:
1. start 3nodes HA cluster
2. create 32600 queues being replicated to backups
3. run script to create&delete more queues:

MAXQUEUES=1000
MAXTIMEOUT=5
MAXCOMSUMERS=200

i=0
while true; do
  if [ $(pgrep qpid-receive | wc -l) -lt $MAXCOMSUMERS ]; then
    if [ $(($((i))%2)) -eq 0 ]; then
      qParam="{create:always, node:{x-declare:{auto-delete:True, arguments:{'qpid.auto_delete_timeout':1}}}}"
    else
      qParam="{create:always, delete:always}"
    fi
    qpid-receive -a "qqq+$(($((RANDOM))%$((MAXQUEUES))))+qqq; $qParam" -b localhost:5672,localhost:5673,localhost:5674 \
      --timeout=$(($(($((RANDOM))%$((MAXTIMEOUT))))+1)) --connection-options "{reconnect:true}" > /dev/null 2>&1 &
    sleep 0.01
  else
    sleep 0.5
  fi
  i=$(($((i))+1))
  if [ $i -eq 20 ]; then
    pgrep qpid-receive | wc
    i=0
  fi
done

4. After few minutes, kill the latest script (ignore it's output).
5. Wait a minute and grep for "incorrect session name" in qpid logs.


Actual results:
tons of errors like above, potential other impact unknown ATM


Expected results:
no such errors


Additional info:

Comment 2 Pavel Moravec 2016-10-21 08:46:49 UTC
so far identified patches needed to backport:

1) Just 32k queue practical limit (and no check if the limit is reached) - fixed by [1], present in 0.30
2) Old bridge keeps track of session map even for a deleted connection - new connection must rebuild such map - fixed by [2], present in 0.34
3) regression introduced by [2] - fixed in [3]

[1] https://github.com/apache/qpid-cpp/commit/573cff7d2b9095871de5a6dd1c987f31c4fb8d01
[2] https://github.com/apache/qpid-cpp/commit/f44aed6258cc9b2183561329e5255db8bb63800e
[3] https://github.com/apache/qpid-cpp/commit/27f7eec9a49bd0185c4fe7a8a6cfdc2ab9143271

Comment 3 Pavel Moravec 2016-10-21 08:49:57 UTC
comment for QE if verification will be required:

due to some fix, the threshold to hit the bug (if the bug wont be properly fixed) is moved by the bugfix from 32k queues to 64k queues. So worth trying original scenario and also with step:

./create_many_queues.sh 1 65368

Comment 4 Alan Conway 2017-01-18 23:57:25 UTC
Fixed in a single rolled up commit on 0.18-mrg-MCC-BZ1387541. The 3 commits involved all stepped on each other so it was clearer that way. There was some refactoring to work around but the final changes are quite small.

874c7541912ed1b963182b67a02174812555eb74

Author:     Alan Conway <aconway>
AuthorDate: Thu Jan 9 22:12:25 2014 +0000
Commit:     Alan Conway <aconway>
CommitDate: Wed Jan 18 18:49:20 2017 -0500

Parent:     18b4324 Bug 1387541,QPID-5454: Sporadic core dump in ha_tests.ReplicationTests.test_auto_delete_failover
Merged:     0.18-mrg-MCC-BZ1387541
Containing: 0.18-mrg-MCC-BZ1387541
Follows:    mrg-2.5.17-ga (5)

Bug 1387541 - session mismatch on replication links when having 32k queues

This is a combined backport of commits for the following issues:

QPID-5513: HA backup fails if number of replicated queues exceeds number of channels.

The problem:
- create cluster of 2 brokers.
- create more than 32768 queues (exceeds number of channels on a connection)
- backup exits with critical error but
- client creating queues receives no error, primary continues with unreplicated queue.

The solution: Primary raises an error to the client if it attempts to create
queues in excess of the channel limit. The queue is not created on primary
or backup, primary and backup continue as normal.

In addition: raised the channel limit from 32k to 64k. There was no reason for
the smaller limit. See discussion: http://qpid.2158936.n2.nabble.com/CHANNEL-MAX-and-CHANNEL-HIGH-BIT-question-tp7603121p7603138.html

New unit test to reproduce the issue, must create > 64k queues.

Other minor improvements:
- brokertest framework doesn't override --log options in the arguments.
- increased default heartbeat in test framework for tests that have busy brokers.

git-svn-id: https://svn.apache.org/repos/asf/qpid/trunk/qpid@1561206 13f79535-47bb-0310-9956-ffa450edef68

Bug 1387541,QPID-5513: Fix windows link error from previous commit.

Fix link error (incorrect libraries listed) in ha_test_max_queues on windows.

git-svn-id: https://svn.apache.org/repos/asf/qpid/trunk@1561754 13f79535-47bb-0310-9956-ffa450edef68

Bug 1387541,QPID-5666: HA fails with resource-limit-exceeded: Exceeded replicated queue limit

This is regression introduced in r1561206: CommitDate: Fri Jan 24 21:54:59 2014 +0000
  QPID-5513: HA backup fails if number of replicated queues exceeds number of channels.

Fixed by the current commit. PrimaryQueueLimits was not taking account of queues already
on the broker prior to promotion.

git-svn-id: https://svn.apache.org/repos/asf/qpid/trunk@1585507 13f79535-47bb-0310-9956-ffa450edef68

Comment 6 Alan Conway 2017-01-31 02:25:20 UTC
I reduced the internal limit to 61534 (hard limit - 4000) I'm able to run your test (starting from 61400 queues) for an hour with no backup failing.

The hard limit is the 64k channel limit on AMQP connection. The "channel pool" error is because there is a delay between deleting the queue and closing the channel - so we temporarily have more channels than queues when queues are created & deleted rapidly. By leaving the 4000 "spare" channels we should be safe in any realistic situation.

http://git.app.eng.bos.redhat.com/git/rh-qpid.git/commit/?h=0.18-mrg-MCC-BZ1387541&id=a637bf879029b13e361853a987758ac069a2a968

commit a637bf879029b13e361853a987758ac069a2a968
Author: Alan Conway <aconway>
Commit: Alan Conway <aconway>

    Bug 1387541 - lower limit on queue replication
    
    Queue replication is limited by the 64k channel limit on the primary/backup
    connection. Previous fixes ensure the primary never creates more than 64k
    queues. However there is a delay between deleting a queue and closing its
    channel, so when queues are created/deleted very rapidly its possible to
    temporarily have more channels than queues which can overrun the limit.
    
    Reduced the limit to reserve 4000 "spare" channels. Testing suggests this is
    enough to prevent overrun in any likely scenario.

Comment 7 Pavel Moravec 2017-02-01 14:17:56 UTC
(In reply to Alan Conway from comment #6)
> I reduced the internal limit to 61534 (hard limit - 4000) I'm able to run
> your test (starting from 61400 queues) for an hour with no backup failing.
> 
> The hard limit is the 64k channel limit on AMQP connection. The "channel
> pool" error is because there is a delay between deleting the queue and
> closing the channel - so we temporarily have more channels than queues when
> queues are created & deleted rapidly. By leaving the 4000 "spare" channels
> we should be safe in any realistic situation.
> 
> http://git.app.eng.bos.redhat.com/git/rh-qpid.git/commit/?h=0.18-mrg-MCC-
> BZ1387541&id=a637bf879029b13e361853a987758ac069a2a968
> 
> commit a637bf879029b13e361853a987758ac069a2a968
> Author: Alan Conway <aconway>
> Commit: Alan Conway <aconway>
> 
>     Bug 1387541 - lower limit on queue replication
>     
>     Queue replication is limited by the 64k channel limit on the
> primary/backup
>     connection. Previous fixes ensure the primary never creates more than 64k
>     queues. However there is a delay between deleting a queue and closing its
>     channel, so when queues are created/deleted very rapidly its possible to
>     temporarily have more channels than queues which can overrun the limit.
>     
>     Reduced the limit to reserve 4000 "spare" channels. Testing suggests
> this is
>     enough to prevent overrun in any likely scenario.

In the relevant build, I cant reproduce the problematic behaviour further.