Description of problem: Publishing via an exchange to e.g. 200 queues, the broker becomes almost non-responsive during the publishing period. After the publishing ends and consumers drain the queues, responsiveness gets better (but far from normal), until the queues get empty. From black-box perspective, there is no reason why the broker is such non-responsive (that e.g. initial connection handshake configured via max-negotiate-time leads to timeout). Version-Release number of selected component (if applicable): qpid-cpp-server-0.18-14.el6.x86_64 How reproducible: 100% Steps to Reproduce: 1. Use 2 machines, one for clients and one for server 2. Run attached reproducer: ./00893742_start_consumers_producers-loop.sh 200 3. Meantime, run in a loop any tool that monitors PerfQueue1 depth over the time (and monitor duration of each such execution). E.g., I am attaching simple QMF program to get queue statistics. When using it, try: while true; do echo "---------------------------------"; date; time ./get_queue_details PerfQueue1 | cut -d, -f34; done (ensure that "msgDepth:0" is written, otherwise change "-f34" accordingly) Actual results: get_queue_details (or any other client) execution duration (despite it consists of sending and receiving a single message) takes from default 1 second (fetch timeout) until e.g. 15. Highest duration is seen when publishing to the queues, lower when only consuming from them. Once queues are empty again, duration is back on 1s. tcpdumps reveal the problematic delays come from establishing TCP+AMQP connection+session, _not_ from broker responding to QMF request too late. Expected results: No such huge difference of new connection attempt duration within the test. Additional info: some further analysis will be provided later on
Created attachment 765096 [details] reproducer script It uses qpid-send / qpid-receive. Replace "broker" hostname by your qpid broker.
Created attachment 765097 [details] simple C++ program querying broker for queue stats via QMF
"Messaging performance" (message throughput) is not affected by this issue. Increasing number of consumers in the test, message throughpout is quite similar everytime: #consumers producersTime consumersTime #msgs Producers #msgs/s %age of first row Consumers #msgs/s %age of first row prods time/all msgs %age of first row cons time/all msgs %age of first row 10 2.392 10.337 500000 20903 100.00% 48370 100.00% 229933 100.00% 53207 100.00% 20 3.624 17.655 1000000 13797 66.00% 56641 117.10% 289735 126.01% 59473 111.78% 50 6.043 41.034 2500000 8274 39.58% 60925 125.96% 421976 183.52% 62144 116.80% 100 10.915 81.903 5000000 4581 21.91% 61048 126.21% 462666 201.22% 61658 115.88% 150 18.138 117.297 7500000 2757 13.19% 63940 132.19% 416253 181.03% 64367 120.97% 200 24.629 151.311 10000000 2030 9.71% 66089 136.63% 408056 177.47% 66419 124.83% 250 39.319 180.228 12500000 1272 6.08% 69357 143.39% 319184 138.82% 69634 130.87% 300 35.769 232.393 15000000 1398 6.69% 64546 133.44% 420755 182.99% 64761 121.72%
Latency / delay in establishing new connections is the most affected. BUT it can be improved by increasing number of working threads. For 10 producers, 15 or 20 worker threads improve the latency by 25%: test connection attempt after .. s AVG new con. delay MEDIAN new con. delay producersTime consumersTime relative AVG new con. relative MED new con. default 0 7.133 7.059 12.874 78.321 100.00% 100.00% threads=10 0 7.653 6.129 18.363 78.390 107.28% 86.83% threads=15 0 5.144 4.419 21.672 76.661 72.11% 62.60% threads=20 0 5.620 5.621 23.508 77.985 78.79% 79.63% tcp-nodelay 0 12.457 12.158 17.573 78.623 174.64% 172.23% connBacklog=100 0 10.572 10.864 14.787 79.186 148.21% 153.90% m=no 0 12.945 13.825 16.600 79.224 181.47% 195.85% default 1 10.033 9.360 15.557 76.607 100.00% 100.00% threads=10 1 10.667 10.573 16.327 75.135 106.32% 112.96% threads=15 1 7.580 6.978 23.199 77.159 75.55% 74.55% threads=20 1 6.801 7.422 25.127 76.761 67.79% 79.29% tcp-nodelay 1 10.863 8.808 14.203 79.282 108.28% 94.10% connBacklog=100 1 10.349 8.797 15.660 78.857 103.15% 93.99% m=no 1 11.396 10.459 13.179 76.595 113.59% 111.74% default 2 10.806 8.683 12.848 77.105 100.00% 100.00% threads=10 2 11.074 10.736 17.847 76.491 102.48% 123.64% threads=15 2 8.049 7.919 18.894 77.746 74.48% 91.20% threads=20 2 6.504 6.934 24.330 76.378 60.19% 79.86% tcp-nodelay 2 12.284 11.699 14.582 79.643 113.68% 134.73% connBacklog=100 2 9.235 7.483 12.990 78.632 85.46% 86.18% m=no 2 13.136 10.060 13.546 77.401 121.56% 115.86%
Andrew, do we have a good understanding of why this happens? My suspicion is yes, so follow up: do we have any leads on how to improve it?
Yet another round of tests performed, with consumers switched off and just having producers there. Latencies in new connection establishment were much much lower, especially when increasing worker-threads. That leads me to this theory: 1) worker thread takes a message from TCP socket and routes it to exchange 2) message matches 200 queues, so for each individual queue, some locking / synchronization of this thread with the queue and/or subscription is made - that requires some time 3) Doing so for more and more queues, time required to process one incoming message by the thread increases 4) since some threshold (50 queues or so), all worker threads are fully busy by the activity, such that they dont process other TCP sockets on "fairly enough" basis 5) new connection requests are delayed, causing new connections are timeouted (or take ages) 6) processing received heartbeats is delayed, causing disconnections due to false positive timeouts Am I right?
How is the symptom affected if you split the producers and consumers across several distinct exchanges? I.e. n consumer, m producers on 1 exchange versus n/e consumers, m/e producers each for e exchanges of the same type.
(In reply to Gordon Sim from comment #7) > How is the symptom affected if you split the producers and consumers across > several distinct exchanges? I.e. n consumer, m producers on 1 exchange > versus n/e consumers, m/e producers each for e exchanges of the same type. Good idea, the symptom is visible mainly for >100 consumers per one exchange. In particular, for 300 consumers and 12 producers, times are: #exchanges AVG new con. delay MEDIAN new con. delay producersTime 1 23.196 23.796 21.589 2 14.660 14.550 18.217 3 15.393 15.727 27.195 4 10.598 10.611 38.289 6 8.171 8.150 33.083 12 7.151 6.493 21.764 Basically there are 2 thresholds where "more worse" new connection latency is seen: - around 50 queues per exchange (6exchanges for 300 consumers) - 100-150 queues per exchange (3-4exch. for 300cons.)
(In reply to Pavel Moravec from comment #8) > (In reply to Gordon Sim from comment #7) > > How is the symptom affected if you split the producers and consumers across > > several distinct exchanges? I.e. n consumer, m producers on 1 exchange > > versus n/e consumers, m/e producers each for e exchanges of the same type. > > Good idea, the symptom is visible mainly for >100 consumers per one > exchange. In particular, for 300 consumers and 12 producers, times are: > > #exchanges AVG new con. delay MEDIAN new con. delay producersTime > 1 23.196 23.796 21.589 > 2 14.660 14.550 18.217 > 3 15.393 15.727 27.195 > 4 10.598 10.611 38.289 > 6 8.171 8.150 33.083 > 12 7.151 6.493 21.764 > > > Basically there are 2 thresholds where "more worse" new connection latency > is seen: > - around 50 queues per exchange (6exchanges for 300 consumers) > - 100-150 queues per exchange (3-4exch. for 300cons.) That holds when producers send the same number of messages. While consumers get less messages - only from the producers sending to "their" exchange. When each producer sends messages to each exchange (such that each consumer gets messages form all prods, the results are different: #exchanges (same received msgs) AVG new con. delay MEDIAN new con. delay producersTime 1 25.590 25.146 31.703 2 25.686 25.630 27.571 3 31.195 30.230 33.651 4 31.529 29.111 32.710 6 32.689 32.926 36.273 12 37.641 36.862 58.448 I.e. there is no valid workaround in having e.g. 2 exchanges with 1/2 consumers bound to each, and producers sending to both exchanges twice.
If I've understood the set up correctly there are mulitple producers simultaneously sending to the same queues (I think the number of exchanges isn't relevant because the exchanges are stateless from the point of view of processing messages so don't need locking in that path). If so then this seem to me to be caused by contention over queue locks in the enqueue path, so holding the locks for a shorter time (or eliminating them altogether) will help.
(In reply to Andrew Stitcher from comment #10) > If I've understood the set up correctly there are mulitple producers > simultaneously sending to the same queues (I think the number of exchanges > isn't relevant because the exchanges are stateless from the point of view of > processing messages so don't need locking in that path). > > If so then this seem to me to be caused by contention over queue locks in > the enqueue path, so holding the locks for a shorter time (or eliminating > them altogether) will help. While undoubtedly eliminating or reducing locking will help, I wonder if there is more to this. How would this situation compare to one where a single queue was shared by all consumers? There would be even more contention there. Would there be similar starvation of new connections? How much work can a connection do on a worker thread before returning to the poller? Is that controlled by volume of data? number of io operations? (Also, nit-pick, exchanges aren't entirely state free from the pov of processing messages as they have to match against bindings which requires some locking at least. The scope of locking is much smaller than for queues of course).
Andrew (or anybody else), thinking about workaround: Won't worker-threads=50 or even 100 harm a system with 12 cores? /me just trying to test it..
(In reply to Gordon Sim from comment #11) > (Also, nit-pick, exchanges aren't entirely state free from the pov of > processing messages as they have to match against bindings which requires > some locking at least. The scope of locking is much smaller than for queues > of course). At least some of the exchanges use a pointer following scheme that avoids locking on this path completely. So that they atomically follow a pointer the bindings which atomically switch from an older state to a newer without blocking the message routing path. I think the topic and xml exchanges can't do this as their bindings are more complex.
(In reply to Gordon Sim from comment #11) > How would this situation compare to one where a single queue was shared by > all consumers? There would be even more contention there. Would there be > similar starvation of new connections? This would certainly be an interesting comparison point to attempt to compare lock impact/contention from different locations in the broker. > > How much work can a connection do on a worker thread before returning to the > poller? Is that controlled by volume of data? number of io operations? The amount of work that can happen on a single connection before the worker goes back to find another connection is limited by 2ms (or if the connection runs out of data to read or blocks for writing). This 2ms limit includes all processing by all code in the broker (exchange/queue/etc.) and is imposed by qpid/sys/posix/AsynchIO.cpp in AsynchIO::readbuff() & AsynchIO::writebuff(). Using the stap probes in that module is a good way to see what the actual waiting time distribution is. It would be possible to change that limit if it turns out that every thread is spending 2ms.
(In reply to Pavel Moravec from comment #12) > Andrew (or anybody else), thinking about workaround: > > Won't worker-threads=50 or even 100 harm a system with 12 cores? > > /me just trying to test it.. This is a very interesting questions: The worker threading design with a fixed number of worker threads (roughly the same as the number of cores) assumes that the threads are always busy and hardly ever get held up by locks. In this case then you can't really get better performance by adding threads, because you just end up adding context switching overhead to service more threads than can actually run at once. However if there is substantial time waiting for locks then maybe we can improve the thruput by adding extra threads that can do work whilst other threads are blocked by locks. However it is also possible that adding extra threads will proportionally worsen the lock contention too. The direction we have pursued in the past has been to try to eliminate lock contention and this can only be a good thing in any event. In the end running some experiments with the load that causes the problem will give us some useful data points.
(In reply to Andrew Stitcher from comment #13) > (In reply to Gordon Sim from comment #11) > > > (Also, nit-pick, exchanges aren't entirely state free from the pov of > > processing messages as they have to match against bindings which requires > > some locking at least. The scope of locking is much smaller than for queues > > of course). > > At least some of the exchanges use a pointer following scheme that avoids > locking on this path completely. *All* the exchanges take a lock at some point during the routing of messages. The best case is the FanoutExchange that simply takes the lock in CopyOnWriteArray::snapshot() (the headers exchange is somewhat similar, just with more complex matching after taking the snapshot). The DirectExchange takes that also, as well as another lock protecting the map it holds keyed on binding key. The Topic- and Xml- Exchanges use read locks. I am of course merely being pedantic here as the reproducer uses the FanoutExchange where the locking is minimal and I certainly am not suggesting its relevant.
> *All* the exchanges take a lock at some point during the routing of > messages. I stand corrected, apologies for the misinformation. (I'm still wondering why I thought the exchanges worked without locking though)
(In reply to Andrew Stitcher from comment #15) > (In reply to Pavel Moravec from comment #12) > > Andrew (or anybody else), thinking about workaround: > > > > Won't worker-threads=50 or even 100 harm a system with 12 cores? > > > > /me just trying to test it.. > > This is a very interesting questions: > > The worker threading design with a fixed number of worker threads (roughly > the same as the number of cores) assumes that the threads are always busy > and hardly ever get held up by locks. > > In this case then you can't really get better performance by adding threads, > because you just end up adding context switching overhead to service more > threads than can actually run at once. I finished the tests just before I left to PTO. Having many worker threads _do_ improve the latency a lot. I.e. test suite with 75 parallel producers shows that worker-threads set to #producers and more start to provide significantly better latencies (see 1st and last column): 75 producers, 200 cons, 3333msgs/prod, 10 content size;;;;; #worker-threads;AVG new con. delay;MEDIAN new con. delay;producersTime;relative AVG new con.;relative MED new con. 5;40.029;39.899;56.612;100.00%;100.00% 10;18.617;17.843;30.806;46.51%;44.72% 13;14.498;14.391;27.823;36.22%;36.07% 25;14.420;12.893;27.177;36.02%;32.31% 50;10.623;9.557;27.474;26.54%;23.95% 55;10.006;8.684;27.546;25.00%;21.76% 60;9.816;9.254;28.300;24.52%;23.19% 65;9.992;9.516;28.609;24.96%;23.85% 70;9.6004;8.902;28.514;23.98%;22.31% 75;7.953;7.886;29.169;19.87%;19.76% 80;6.5586;6.763;29.874;16.38%;16.95% 85;4.9234;4.752;30.207;12.30%;11.91% 90;3.4268;3.363;30.11;8.56%;8.43% 95;2.0554;2.167;30.251;5.13%;5.43% 100;1.1928;1.186;30.667;2.98%;2.97% 125;1.1552;1.175;31.425;2.89%;2.94%
Pavel, I'm not seeing this behavior with brokers built from sources: origin/0.14-mrg-preview-2 origin/0.18-mrg origin/0.22-mrg Using your scripts, I'm seeing connection times that appear to be normal. Here's what I'm doing: * on mrg30 build and make install broker run broker with default settings * on mrg25 compile your get_queue_details modify the 00893742_start_consumers_producers-loop.sh to point to mrg30 run ./00893742_start_consumers_producers-loop.sh 200 after queues are created, run ./get_queue_details in a loop like: while true; do echo "---------------------------------"; date; time ./get_queue_details PerfQueue1 mrg30.lab.bos.redhat.com | cut -d, -f35; done The get_queue_details is outputting blocks like --------------------------------- Fri Jun 6 11:10:53 EDT 2014 msgDepth:34000 real 0m1.528s user 0m0.015s sys 0m0.006s I see the msgDepth climb and then gradually fall to 0. The time between the messages is about 1.5 seconds. It drops to about 1 second after all the messages are drained. This is consistent between the three versions of the brokers that I built from source. Am I misinterpreting the output of get_queue_details or doing something else wrong? I tried it with and without the broker store enabled.
(In reply to Ernie from comment #20) > Pavel, I'm not seeing this behavior with brokers built from sources: > origin/0.14-mrg-preview-2 > origin/0.18-mrg > origin/0.22-mrg > > Using your scripts, I'm seeing connection times that appear to be normal. > Here's what I'm doing: > * on mrg30 > build and make install broker > run broker with default settings > > * on mrg25 > compile your get_queue_details > modify the 00893742_start_consumers_producers-loop.sh to point to mrg30 > run ./00893742_start_consumers_producers-loop.sh 200 > after queues are created, run ./get_queue_details in a loop like: > while true; do echo "---------------------------------"; date; time > ./get_queue_details PerfQueue1 mrg30.lab.bos.redhat.com | cut -d, -f35; done > > > The get_queue_details is outputting blocks like > --------------------------------- > Fri Jun 6 11:10:53 EDT 2014 > msgDepth:34000 > > real 0m1.528s > user 0m0.015s > sys 0m0.006s > > I see the msgDepth climb and then gradually fall to 0. The time between the > messages is about 1.5 seconds. It drops to about 1 second after all the > messages are drained. > > This is consistent between the three versions of the brokers that I built > from source. > > Am I misinterpreting the output of get_queue_details or doing something else > wrong? > > I tried it with and without the broker store enabled. How many producers do you use ("prods=12" in the script?)? How many worker threads does qpidd have? #producers should be more than #cores /#workThreads, afaik. (or could you provide me credentials to both machines so I can set up the repro there? with pointing me to 0.18 broker (how to start it))
The broker was running on mrg30, which has 8 cores (2 cpus with 4 cores each) The 00893742_start_consumers_producers-loop.sh script was running with the default 10 producers and 200 consumers. I tried the test with --worker-threads 1 and saw the output from get_queue_details raise to about 2.5 seconds (from about 1.5). The 0.18 broker is the version that is installed on mrg30. To run, I just use qpidd --no-module-dir since both the linearstore and legacystore are in the /usr/local/lib64/qpid/daemon/ dir. I'll pm you the root credentials. The same credentials work on mrg25 where I ran the clients. However, I ran the clients under my personal account. /home/eallen You can grab the client scripts from there.
Two problems fixed in reproducer on mrg25 / mrg30: 1) qpid-send finished so so fast (within a fraction of second). While we need them running while other client is trying to connect. I think Python qpid-send / qpid-receive is not working properly there, as e.g. no queue "PerfQueue*" is created. I changed the reproducer script to: - send 20k messages per producer instead of 5k (faster machine than mine, so send more msgs to have similar durations) - use C++ client (/home/eallen/0.22/qpid/cpp/build/src/tests/qpid-*) original script: ./00893742_start_consumers_producers-loop.sh.orig new script: ./00893742_start_consumers_producers-loop.sh 2) get_queue_details due to an unknown reason did not print queue statistics. Let use: time qpid-stat -e -b mrg30.lab.bos.redhat.com:5672 as the indicator client.
Created attachment 911803 [details] Move duration calculation in AsyncIO.cpp
The attached patch changes where the duration is calculated in sys/posix/AsyncIO.cpp->readable(). This will prevent the loop from executing past the threadMaxIoTimeNs. Doing this decreased the connection delay as seen in Pavel's reproducer scripts by 19%. Workaround: Increase worker-threads when starting the broker. The connection delay disappears when the number of worker treads is approx. 1.5 times the number of message producers. A suggested future strategy (for a different bz): Instead of having all the publishing threads queueing up for each other, queue up the routing requests and let one thread at a time handle a batch of them. You may still content with any consumers, but it should alleviate the problem a fair bit. Not an entirely trivial fix though as the handling of acknowledgements, errors etc needs to be done on a worker servicing the original connection, but certainly achievable.
Patch committed upstream: ------------------------------------------------------------------------ r1605664 | aconway | 2014-06-26 00:34:12 -0400 (Thu, 26 Jun 2014) | 7 lines QPID-5843: Producing to many queues locks I/O threads for new connections This patch changes where the duration is calculated in sys/posix/AsyncIO.cpp->readable(). This will prevent the loop from executing past the threadMaxIoTimeNs. Doing this decreased the connection delay observed by 19%. ------------------------------------------------------------------------
Overall response time shows about 20 percent improvement. This vary from machine to machine.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2015-0805.html