Bug 697966 - Consumption throughput decreases under increasing producer load
Summary: Consumption throughput decreases under increasing producer load
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 1.3
Hardware: Unspecified
OS: Unspecified
medium
unspecified
Target Milestone: 3.0
: ---
Assignee: Andrew Stitcher
QA Contact: Leonid Zhaldybin
URL:
Whiteboard:
Depends On:
Blocks: 698367
TreeView+ depends on / blocked
 
Reported: 2011-04-19 19:18 UTC by Jason Dillaman
Modified: 2014-11-09 22:38 UTC (History)
6 users (show)

Fixed In Version: qpid-0.16
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-09-24 15:02:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Federation Perf Test (961 bytes, text/plain)
2011-04-19 19:18 UTC, Jason Dillaman
no flags Details
AIO performance workaround patch (2.12 KB, patch)
2011-05-25 23:04 UTC, Jason Dillaman
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Apache JIRA QPID-3608 0 None None None 2012-11-14 19:52:38 UTC
Red Hat Product Errata RHEA-2014:1296 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging 3.0 Release 2014-09-24 19:00:06 UTC

Description Jason Dillaman 2011-04-19 19:18:11 UTC
Description of problem: 
Using the following broker setup:

  Broker1 Exchange->Broker1 Queue->federated link->Broker2 Exchange->Broker2 Queue

When sending messages to Broker1's exchange, qpid-queue-stats is utilized to monitor the enqueue and dequeue rates of Broker1's queue.  As the enqueue rate increases when additional senders publish messages on Broker1's exchange, the dequeue rate decreases.  

For example:
1 sender:  Enq: 42K msgs/sec, Deq: 40K msgs/sec
4 senders: Enq: 86K msgs/sec, Deq: 22K msgs/sec
8 senders: Enq: 92K msgs/sec , Deq: 14K msgs/sec

This only appears to occur when utilizing a federated link.  Replacing the link with a client consumer on Broker1's queue does not cause this issue.

Version-Release number of selected component (if applicable):
qpid-cpp-client-0.7.946106-28_ptc_hotfix_5_v2.el5
qpid-cpp-server-0.7.946106-28_ptc_hotfix_5_v2.el5

How reproducible:
100% as sender load is increased

Steps to Reproduce:
See attached script

Comment 1 Jason Dillaman 2011-04-19 19:18:45 UTC
Created attachment 493264 [details]
Federation Perf Test

Comment 2 Jason Dillaman 2011-04-21 17:26:44 UTC
It appears that the originally reported issue (federated link performance), is actually a more general problem.  Even within a single broker, if there are more producers than consumers, the consumers appear to be starved even though the aggregate message production rate is about the same.

For example, using qpid-perftest:
10 producers, 10 consumers: Enq ~12K msgs/sec, Deq ~12K msgs/sec
100 producers, 10 consumers: Enq ~12K msgs/sec, Deq ~3K msgs/sec

Steps to Reproduce:

# Pubs == # Subs
qpid-perftest --mode topic --qt 10 --npub 1 --nsub 1 --sub-ack 1000 --size 300 --count 400000

More pubs than subs:
qpid-perftest --mode topic --qt 10 --npub 10 --nsub 1 --sub-ack 1000 --size 300 --count 40000

Results:

# Pubs == # Subs

anonymous.294120bf-f710-4e85-893f-ba278187ac57     10.00       1139     11934.54     11899.15
anonymous.6248feef-b442-4845-b47b-a076f1ebfcef     10.00        301     12001.15     12050.54
anonymous.6522f6f9-e152-40bc-b5f1-b6c180e08cc2     10.00        357     11942.06     11999.36
anonymous.85c181a6-d3ab-4457-8be5-89f54f8d5d3c     10.00        293     11984.57     11999.37
anonymous.8f8dd0df-ca86-427e-890e-d475fd114ace     10.00        800     11918.18     11899.38
anonymous.914be4c8-e334-487f-8fb5-0958c937609c     10.00       1228     11967.98     11899.39
anonymous.cde7e975-0626-459a-9cb2-a97fa9082a8a     10.00        416     12017.79     12077.38
anonymous.d395d062-39bd-4f5d-92d3-24e0d8aa8bfe     10.00        738     11917.59     11899.40
anonymous.e63378b8-5fb4-4284-ba0e-a96fb57d1b6a     10.00       1100     11992.69     11999.39
anonymous.fbb6f32c-fd79-48bb-8646-edfc9a86bb46     10.00        548     11835.21     11899.40

anonymous.294120bf-f710-4e85-893f-ba278187ac57     10.00        832     11867.29     11897.98
anonymous.6248feef-b442-4845-b47b-a076f1ebfcef     10.00       1178     11785.70     11698.02
anonymous.6522f6f9-e152-40bc-b5f1-b6c180e08cc2     10.00       1136     11975.87     11897.99
anonymous.85c181a6-d3ab-4457-8be5-89f54f8d5d3c     10.00       1149     11883.59     11798.00
anonymous.8f8dd0df-ca86-427e-890e-d475fd114ace     10.00        490     11866.99     11897.99
anonymous.914be4c8-e334-487f-8fb5-0958c937609c     10.00        246     11999.76     12097.95
anonymous.cde7e975-0626-459a-9cb2-a97fa9082a8a     10.00        760     11933.38     11898.98
anonymous.d395d062-39bd-4f5d-92d3-24e0d8aa8bfe     10.00        963     11920.48     11897.98
anonymous.e63378b8-5fb4-4284-ba0e-a96fb57d1b6a     10.00        458     11833.79     11897.98
anonymous.fbb6f32c-fd79-48bb-8646-edfc9a86bb46     10.00        404     11883.59     11897.98

More pubs than subs:

Queue Name                                     Sec       Depth     Enq Rate     Deq Rate
========================================================================================
anonymous.0a265f08-0986-4409-98c3-9b96333349f5     10.00     194839     16851.79      2900.51
anonymous.25ab9cc6-49d3-4c3e-9559-6e268b8d84a7     10.00     194029     16918.40      2900.51
anonymous.341ac969-3743-4e9e-bcb6-141feb057dc4     10.00     195108     17034.63      3000.53
anonymous.49821c81-0ac2-45e4-a875-47c8b4e9e7ca     10.00     196373     16902.90      2900.52
anonymous.6b377262-173a-46a7-87ce-cd98ff6c39ff     10.00     193455     16901.81      3000.53
anonymous.8d6762cd-df53-4d8b-a21e-c1b1a68cfc4f     10.00     190440     16802.20      2900.52
anonymous.97ced592-2182-4d2f-8de9-30d2707513e7     10.00     196771     16998.64      3000.54
anonymous.a8140b94-ffda-47b5-a9f3-f0a45a32b5d5     10.00     194564     16885.22      3000.54
anonymous.b1e691fc-f757-4dc5-9691-daa67ee81358     10.00     196647     16917.73      2900.52
anonymous.b4196dfe-a319-4534-9980-278e133e12cc     10.00     195056     16881.33      2900.52

anonymous.0a265f08-0986-4409-98c3-9b96333349f5     10.00     324427     15755.66      2799.44
anonymous.25ab9cc6-49d3-4c3e-9559-6e268b8d84a7     10.00     324275     15821.44      2799.44
anonymous.341ac969-3743-4e9e-bcb6-141feb057dc4     10.00     325732     15759.26      2699.46
anonymous.49821c81-0ac2-45e4-a875-47c8b4e9e7ca     10.00     325159     15675.48      2799.44
anonymous.6b377262-173a-46a7-87ce-cd98ff6c39ff     10.00     323491     15700.48      2699.46
anonymous.8d6762cd-df53-4d8b-a21e-c1b1a68cfc4f     10.00     320306     15783.46      2799.44
anonymous.97ced592-2182-4d2f-8de9-30d2707513e7     10.00     325937     15613.50      2699.46
anonymous.a8140b94-ffda-47b5-a9f3-f0a45a32b5d5     10.00     325447     15785.16      2699.46
anonymous.b1e691fc-f757-4dc5-9691-daa67ee81358     10.00     326347     15766.87      2799.44
anonymous.b4196dfe-a319-4534-9980-278e133e12cc     10.00     325096     15800.86      2799.44

anonymous.0a265f08-0986-4409-98c3-9b96333349f5     10.00     128000       457.08     20090.42
anonymous.25ab9cc6-49d3-4c3e-9559-6e268b8d84a7     10.00     131877       872.08     20102.72
anonymous.341ac969-3743-4e9e-bcb6-141feb057dc4     10.00     133000       726.45     19990.47
anonymous.49821c81-0ac2-45e4-a875-47c8b4e9e7ca     10.00     133000       783.73     19990.47
anonymous.6b377262-173a-46a7-87ce-cd98ff6c39ff     10.00     131000       950.45     20190.38
anonymous.8d6762cd-df53-4d8b-a21e-c1b1a68cfc4f     10.00     131000      1268.80     20190.37
anonymous.97ced592-2182-4d2f-8de9-30d2707513e7     10.00     131000       606.01     20090.42
anonymous.a8140b94-ffda-47b5-a9f3-f0a45a32b5d5     10.00     132000       754.94     20090.42
anonymous.b1e691fc-f757-4dc5-9691-daa67ee81358     10.00     133744       664.98     19916.11
anonymous.b4196dfe-a319-4534-9980-278e133e12cc     10.00     134000       790.02     19890.52

Comment 3 Jason Dillaman 2011-04-21 20:28:40 UTC
This issue might be related to code in the POSIX AsynchIO::writeable method that attempts to prevent the writes from dominating the reads.  Commenting out this section of code appears to solve this issue (but might have introduced regression issues elsewhere).

--- qpid-cpp-mrg-0.7.946106-orig/cpp/src/qpid/sys/posix/AsynchIO.cpp 2011-04-21 13:20:51.000000000 -0700
+++ qpid-cpp-mrg-0.7.946106-patch/cpp/src/qpid/sys/posix/AsynchIO.cpp   2011-04-21 13:10:21.000000000 -0700
@@ -513,8 +513,8 @@ void AsynchIO::writeable(DispatchHandle&
                 
                 // If we've already written more than the max for reading then stop
                 // (this is to stop writes dominating reads) 
-                if (writeTotal > threadMaxRead)
-                    break;
+//                if (writeTotal > threadMaxRead)
+//                    break;
             } else {
                 // Put buffer back
                 writeQueue.push_back(buff);

Comment 4 Andrew Stitcher 2011-04-21 22:04:04 UTC
It is probably safe to remove this test.

As the comment notes it was included to stop writing to a single socket completely dominating the activity of an io thread.

This was observed to happen (before the test was included) in a case where there were sufficient messages going to a single consumer that they were being received as fast or faster than they could be written to the single consumer, in that case the write loop without this test would never exit and the thread would never be available for any other io operations.

The specific test is entirely heuristic and relies on the general observation that the bytes read over the entire broker will equal the bytes written over the entire broker under many (most?) situations, also that each io thread will have broadly similar read/write statistics (the stats are kept on a per thread basis)

This situation my well no longer apply at all in the current broker code.

In any event the area of balancing read against write is something we should look at carefully in the broker and try to improve the heuristics used.

Comment 5 Andrew Stitcher 2011-04-21 22:17:27 UTC
Two other possible fixes:

1. Initialise threadMaxRead to a more reasonable write maximum say a few k (maybe even as high as 64k) rather than 0 so that if all the reads are really small then at least the writes will not be limited by that, however if the reads turn out to be really large then the maximum would still go up.

2. Use a time based maximum as in the read logic (currently 2ms, which is actually corresponds to a very large amount of data - 2ms @ 10Gb = 200Mb)

Comment 6 Jason Dillaman 2011-05-25 23:04:36 UTC
Created attachment 500954 [details]
AIO performance workaround patch

This is the patch that we have been using for PTC to avoid the performance penalty with the current AIO writer logic.

Diff against git commit bb0aabf7e85aa4dd46de3782737bb6327693db68.

Comment 7 Andrew Stitcher 2011-05-26 12:50:09 UTC
(In reply to comment #6)
> Created attachment 500954 [details]
> AIO performance workaround patch
> 
> This is the patch that we have been using for PTC to avoid the performance
> penalty with the current AIO writer logic.
> 
> Diff against git commit bb0aabf7e85aa4dd46de3782737bb6327693db68.

I'd be very happy to push this change upstream. Could you give some detail about the performance change you've seen with the patch

Comment 8 Jason Dillaman 2011-05-26 16:23:29 UTC
With the patch applied, we see much better dequeue performance in the broker when there are more producers compared to consumers (12K msgs/sec deq rate as compared to < 1K msgs/sec).  We haven't seen any adverse effects.

# Pubs == # Subs:

Queue Name                                     Sec       Depth     Enq Rate    
Deq Rate
========================================================================================
anonymous.0047ab96-3386-4e51-8614-90a297a0f8a2     10.00        662     19121.98     19098.99
anonymous.012a3741-888e-4f54-bf9b-34a3cf697f11     10.00        806     21907.45     21998.84
anonymous.306db59a-66ae-472e-9b86-3ce82a34f08b     10.00        739     19794.56     19798.96
anonymous.59c1a746-802f-4f43-93c0-5dbc5da93224     10.00        616     20075.75     20098.95
anonymous.97ada8d2-ac1f-4708-b081-755cc6924a0e     10.00       1086     20566.62     20598.92
anonymous.ad9ef5ab-5a8b-4f99-80d2-ffff06581c86     10.00        532     19337.69     19298.99
anonymous.b9b3e7c9-2799-4406-b7ea-2f216d8d88d5     10.00        534     19392.39     19398.99
anonymous.bc0d6890-24be-466c-8f57-9a538010c8b0     10.00        601     18860.32     18899.02
anonymous.c4ab67da-57a3-459a-9410-3546813ea909     10.00        933     18466.55     18444.75

More pubs than subs:

Queue Name                                     Sec       Depth     Enq Rate    
Deq Rate
========================================================================================
anonymous.1813e0b7-2cf1-47d4-b7d2-b766a1556c8a     10.00     272000     26330.81     11196.77
anonymous.1b8adf71-0926-4c72-bdac-956ec7f2bbec     10.00     265000     24803.78     11896.58
anonymous.1b92c721-c4da-46e8-bea4-73d3d32598ed     10.00     261000     24365.73     12396.45
anonymous.47b674be-36c0-4486-95e4-218016b021fa     10.00     263000     25465.13     12096.55
anonymous.83dd1b0d-5a77-4dfa-bd35-0cd93c7c270d     10.00     261000     23040.85     12396.48
anonymous.afe5c6a7-57bd-4237-92f8-cc778b5a0d3f     10.00     260000     22297.80     12296.52
anonymous.b50e6677-3dc6-46b7-b60f-395626654004     10.00     268000     25887.02     11796.68
anonymous.bbd85791-4423-4a05-b6ad-078df135e744     10.00     266594     25611.13     11937.36
anonymous.be7897ec-7909-4867-8bb5-64d99b080f94     10.00     206849     21266.67     10943.25
anonymous.cba5479c-936c-4a13-8913-99a42417e646     10.00     273000     26614.20     11196.89

Comment 9 Andrew Stitcher 2011-11-11 16:51:34 UTC
A fix for this issue has now been applied upstream (to the 0.15 development stream ready for the 0.16 release). If required this could be backported to a MRG release stream.

r1200925 is the Apache svn revision.

Comment 10 Andrew Stitcher 2011-11-11 16:55:03 UTC
Upstream bug:

https://issues.apache.org/jira/browse/QPID-3608

Comment 11 Ted Ross 2012-03-29 15:54:52 UTC
Flagged for 2.2.0 as this will be picked up in the re-base,

Comment 17 Leonid Zhaldybin 2014-01-21 14:21:07 UTC
Tested on RHEL6.5 (both i386 and x86_64). The performance in case there are more publishers than subscribers is much better for 3.0 (Vienna) builds.

Packages used for testing:

python-qpid-0.22-9.el6
python-qpid-qmf-0.22-25.el6
qpid-cpp-client-0.22-31.el6
qpid-cpp-client-devel-0.22-31.el6
qpid-cpp-client-devel-docs-0.22-31.el6
qpid-cpp-client-rdma-0.22-31.el6
qpid-cpp-client-ssl-0.22-31.el6
qpid-cpp-server-0.22-31.el6
qpid-cpp-server-devel-0.22-31.el6
qpid-cpp-server-ha-0.22-31.el6
qpid-cpp-server-linearstore-0.22-31.el6
qpid-cpp-server-rdma-0.22-31.el6
qpid-cpp-server-ssl-0.22-31.el6
qpid-cpp-server-xml-0.22-31.el6
qpid-proton-c-0.6-1.el6
qpid-qmf-0.22-25.el6
qpid-tools-0.22-7.el6

-> VERIFIED

Comment 19 errata-xmlrpc 2014-09-24 15:02:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1296.html


Note You need to log in before you can comment on or make changes to this bug.