Bug 697966
Summary: | Consumption throughput decreases under increasing producer load | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Jason Dillaman <jdillama> | ||||||
Component: | qpid-cpp | Assignee: | Andrew Stitcher <astitcher> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Leonid Zhaldybin <lzhaldyb> | ||||||
Severity: | unspecified | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 1.3 | CC: | agoldste, astitcher, esammons, jross, lzhaldyb, tross | ||||||
Target Milestone: | 3.0 | Keywords: | Patch | ||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | qpid-0.16 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2014-09-24 15:02:32 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 698367 | ||||||||
Attachments: |
|
Description
Jason Dillaman
2011-04-19 19:18:11 UTC
Created attachment 493264 [details]
Federation Perf Test
It appears that the originally reported issue (federated link performance), is actually a more general problem. Even within a single broker, if there are more producers than consumers, the consumers appear to be starved even though the aggregate message production rate is about the same. For example, using qpid-perftest: 10 producers, 10 consumers: Enq ~12K msgs/sec, Deq ~12K msgs/sec 100 producers, 10 consumers: Enq ~12K msgs/sec, Deq ~3K msgs/sec Steps to Reproduce: # Pubs == # Subs qpid-perftest --mode topic --qt 10 --npub 1 --nsub 1 --sub-ack 1000 --size 300 --count 400000 More pubs than subs: qpid-perftest --mode topic --qt 10 --npub 10 --nsub 1 --sub-ack 1000 --size 300 --count 40000 Results: # Pubs == # Subs anonymous.294120bf-f710-4e85-893f-ba278187ac57 10.00 1139 11934.54 11899.15 anonymous.6248feef-b442-4845-b47b-a076f1ebfcef 10.00 301 12001.15 12050.54 anonymous.6522f6f9-e152-40bc-b5f1-b6c180e08cc2 10.00 357 11942.06 11999.36 anonymous.85c181a6-d3ab-4457-8be5-89f54f8d5d3c 10.00 293 11984.57 11999.37 anonymous.8f8dd0df-ca86-427e-890e-d475fd114ace 10.00 800 11918.18 11899.38 anonymous.914be4c8-e334-487f-8fb5-0958c937609c 10.00 1228 11967.98 11899.39 anonymous.cde7e975-0626-459a-9cb2-a97fa9082a8a 10.00 416 12017.79 12077.38 anonymous.d395d062-39bd-4f5d-92d3-24e0d8aa8bfe 10.00 738 11917.59 11899.40 anonymous.e63378b8-5fb4-4284-ba0e-a96fb57d1b6a 10.00 1100 11992.69 11999.39 anonymous.fbb6f32c-fd79-48bb-8646-edfc9a86bb46 10.00 548 11835.21 11899.40 anonymous.294120bf-f710-4e85-893f-ba278187ac57 10.00 832 11867.29 11897.98 anonymous.6248feef-b442-4845-b47b-a076f1ebfcef 10.00 1178 11785.70 11698.02 anonymous.6522f6f9-e152-40bc-b5f1-b6c180e08cc2 10.00 1136 11975.87 11897.99 anonymous.85c181a6-d3ab-4457-8be5-89f54f8d5d3c 10.00 1149 11883.59 11798.00 anonymous.8f8dd0df-ca86-427e-890e-d475fd114ace 10.00 490 11866.99 11897.99 anonymous.914be4c8-e334-487f-8fb5-0958c937609c 10.00 246 11999.76 12097.95 anonymous.cde7e975-0626-459a-9cb2-a97fa9082a8a 10.00 760 11933.38 11898.98 anonymous.d395d062-39bd-4f5d-92d3-24e0d8aa8bfe 10.00 963 11920.48 11897.98 anonymous.e63378b8-5fb4-4284-ba0e-a96fb57d1b6a 10.00 458 11833.79 11897.98 anonymous.fbb6f32c-fd79-48bb-8646-edfc9a86bb46 10.00 404 11883.59 11897.98 More pubs than subs: Queue Name Sec Depth Enq Rate Deq Rate ======================================================================================== anonymous.0a265f08-0986-4409-98c3-9b96333349f5 10.00 194839 16851.79 2900.51 anonymous.25ab9cc6-49d3-4c3e-9559-6e268b8d84a7 10.00 194029 16918.40 2900.51 anonymous.341ac969-3743-4e9e-bcb6-141feb057dc4 10.00 195108 17034.63 3000.53 anonymous.49821c81-0ac2-45e4-a875-47c8b4e9e7ca 10.00 196373 16902.90 2900.52 anonymous.6b377262-173a-46a7-87ce-cd98ff6c39ff 10.00 193455 16901.81 3000.53 anonymous.8d6762cd-df53-4d8b-a21e-c1b1a68cfc4f 10.00 190440 16802.20 2900.52 anonymous.97ced592-2182-4d2f-8de9-30d2707513e7 10.00 196771 16998.64 3000.54 anonymous.a8140b94-ffda-47b5-a9f3-f0a45a32b5d5 10.00 194564 16885.22 3000.54 anonymous.b1e691fc-f757-4dc5-9691-daa67ee81358 10.00 196647 16917.73 2900.52 anonymous.b4196dfe-a319-4534-9980-278e133e12cc 10.00 195056 16881.33 2900.52 anonymous.0a265f08-0986-4409-98c3-9b96333349f5 10.00 324427 15755.66 2799.44 anonymous.25ab9cc6-49d3-4c3e-9559-6e268b8d84a7 10.00 324275 15821.44 2799.44 anonymous.341ac969-3743-4e9e-bcb6-141feb057dc4 10.00 325732 15759.26 2699.46 anonymous.49821c81-0ac2-45e4-a875-47c8b4e9e7ca 10.00 325159 15675.48 2799.44 anonymous.6b377262-173a-46a7-87ce-cd98ff6c39ff 10.00 323491 15700.48 2699.46 anonymous.8d6762cd-df53-4d8b-a21e-c1b1a68cfc4f 10.00 320306 15783.46 2799.44 anonymous.97ced592-2182-4d2f-8de9-30d2707513e7 10.00 325937 15613.50 2699.46 anonymous.a8140b94-ffda-47b5-a9f3-f0a45a32b5d5 10.00 325447 15785.16 2699.46 anonymous.b1e691fc-f757-4dc5-9691-daa67ee81358 10.00 326347 15766.87 2799.44 anonymous.b4196dfe-a319-4534-9980-278e133e12cc 10.00 325096 15800.86 2799.44 anonymous.0a265f08-0986-4409-98c3-9b96333349f5 10.00 128000 457.08 20090.42 anonymous.25ab9cc6-49d3-4c3e-9559-6e268b8d84a7 10.00 131877 872.08 20102.72 anonymous.341ac969-3743-4e9e-bcb6-141feb057dc4 10.00 133000 726.45 19990.47 anonymous.49821c81-0ac2-45e4-a875-47c8b4e9e7ca 10.00 133000 783.73 19990.47 anonymous.6b377262-173a-46a7-87ce-cd98ff6c39ff 10.00 131000 950.45 20190.38 anonymous.8d6762cd-df53-4d8b-a21e-c1b1a68cfc4f 10.00 131000 1268.80 20190.37 anonymous.97ced592-2182-4d2f-8de9-30d2707513e7 10.00 131000 606.01 20090.42 anonymous.a8140b94-ffda-47b5-a9f3-f0a45a32b5d5 10.00 132000 754.94 20090.42 anonymous.b1e691fc-f757-4dc5-9691-daa67ee81358 10.00 133744 664.98 19916.11 anonymous.b4196dfe-a319-4534-9980-278e133e12cc 10.00 134000 790.02 19890.52 This issue might be related to code in the POSIX AsynchIO::writeable method that attempts to prevent the writes from dominating the reads. Commenting out this section of code appears to solve this issue (but might have introduced regression issues elsewhere). --- qpid-cpp-mrg-0.7.946106-orig/cpp/src/qpid/sys/posix/AsynchIO.cpp 2011-04-21 13:20:51.000000000 -0700 +++ qpid-cpp-mrg-0.7.946106-patch/cpp/src/qpid/sys/posix/AsynchIO.cpp 2011-04-21 13:10:21.000000000 -0700 @@ -513,8 +513,8 @@ void AsynchIO::writeable(DispatchHandle& // If we've already written more than the max for reading then stop // (this is to stop writes dominating reads) - if (writeTotal > threadMaxRead) - break; +// if (writeTotal > threadMaxRead) +// break; } else { // Put buffer back writeQueue.push_back(buff); It is probably safe to remove this test. As the comment notes it was included to stop writing to a single socket completely dominating the activity of an io thread. This was observed to happen (before the test was included) in a case where there were sufficient messages going to a single consumer that they were being received as fast or faster than they could be written to the single consumer, in that case the write loop without this test would never exit and the thread would never be available for any other io operations. The specific test is entirely heuristic and relies on the general observation that the bytes read over the entire broker will equal the bytes written over the entire broker under many (most?) situations, also that each io thread will have broadly similar read/write statistics (the stats are kept on a per thread basis) This situation my well no longer apply at all in the current broker code. In any event the area of balancing read against write is something we should look at carefully in the broker and try to improve the heuristics used. Two other possible fixes: 1. Initialise threadMaxRead to a more reasonable write maximum say a few k (maybe even as high as 64k) rather than 0 so that if all the reads are really small then at least the writes will not be limited by that, however if the reads turn out to be really large then the maximum would still go up. 2. Use a time based maximum as in the read logic (currently 2ms, which is actually corresponds to a very large amount of data - 2ms @ 10Gb = 200Mb) Created attachment 500954 [details]
AIO performance workaround patch
This is the patch that we have been using for PTC to avoid the performance penalty with the current AIO writer logic.
Diff against git commit bb0aabf7e85aa4dd46de3782737bb6327693db68.
(In reply to comment #6) > Created attachment 500954 [details] > AIO performance workaround patch > > This is the patch that we have been using for PTC to avoid the performance > penalty with the current AIO writer logic. > > Diff against git commit bb0aabf7e85aa4dd46de3782737bb6327693db68. I'd be very happy to push this change upstream. Could you give some detail about the performance change you've seen with the patch With the patch applied, we see much better dequeue performance in the broker when there are more producers compared to consumers (12K msgs/sec deq rate as compared to < 1K msgs/sec). We haven't seen any adverse effects. # Pubs == # Subs: Queue Name Sec Depth Enq Rate Deq Rate ======================================================================================== anonymous.0047ab96-3386-4e51-8614-90a297a0f8a2 10.00 662 19121.98 19098.99 anonymous.012a3741-888e-4f54-bf9b-34a3cf697f11 10.00 806 21907.45 21998.84 anonymous.306db59a-66ae-472e-9b86-3ce82a34f08b 10.00 739 19794.56 19798.96 anonymous.59c1a746-802f-4f43-93c0-5dbc5da93224 10.00 616 20075.75 20098.95 anonymous.97ada8d2-ac1f-4708-b081-755cc6924a0e 10.00 1086 20566.62 20598.92 anonymous.ad9ef5ab-5a8b-4f99-80d2-ffff06581c86 10.00 532 19337.69 19298.99 anonymous.b9b3e7c9-2799-4406-b7ea-2f216d8d88d5 10.00 534 19392.39 19398.99 anonymous.bc0d6890-24be-466c-8f57-9a538010c8b0 10.00 601 18860.32 18899.02 anonymous.c4ab67da-57a3-459a-9410-3546813ea909 10.00 933 18466.55 18444.75 More pubs than subs: Queue Name Sec Depth Enq Rate Deq Rate ======================================================================================== anonymous.1813e0b7-2cf1-47d4-b7d2-b766a1556c8a 10.00 272000 26330.81 11196.77 anonymous.1b8adf71-0926-4c72-bdac-956ec7f2bbec 10.00 265000 24803.78 11896.58 anonymous.1b92c721-c4da-46e8-bea4-73d3d32598ed 10.00 261000 24365.73 12396.45 anonymous.47b674be-36c0-4486-95e4-218016b021fa 10.00 263000 25465.13 12096.55 anonymous.83dd1b0d-5a77-4dfa-bd35-0cd93c7c270d 10.00 261000 23040.85 12396.48 anonymous.afe5c6a7-57bd-4237-92f8-cc778b5a0d3f 10.00 260000 22297.80 12296.52 anonymous.b50e6677-3dc6-46b7-b60f-395626654004 10.00 268000 25887.02 11796.68 anonymous.bbd85791-4423-4a05-b6ad-078df135e744 10.00 266594 25611.13 11937.36 anonymous.be7897ec-7909-4867-8bb5-64d99b080f94 10.00 206849 21266.67 10943.25 anonymous.cba5479c-936c-4a13-8913-99a42417e646 10.00 273000 26614.20 11196.89 A fix for this issue has now been applied upstream (to the 0.15 development stream ready for the 0.16 release). If required this could be backported to a MRG release stream. r1200925 is the Apache svn revision. Upstream bug: https://issues.apache.org/jira/browse/QPID-3608 Flagged for 2.2.0 as this will be picked up in the re-base, Tested on RHEL6.5 (both i386 and x86_64). The performance in case there are more publishers than subscribers is much better for 3.0 (Vienna) builds. Packages used for testing: python-qpid-0.22-9.el6 python-qpid-qmf-0.22-25.el6 qpid-cpp-client-0.22-31.el6 qpid-cpp-client-devel-0.22-31.el6 qpid-cpp-client-devel-docs-0.22-31.el6 qpid-cpp-client-rdma-0.22-31.el6 qpid-cpp-client-ssl-0.22-31.el6 qpid-cpp-server-0.22-31.el6 qpid-cpp-server-devel-0.22-31.el6 qpid-cpp-server-ha-0.22-31.el6 qpid-cpp-server-linearstore-0.22-31.el6 qpid-cpp-server-rdma-0.22-31.el6 qpid-cpp-server-ssl-0.22-31.el6 qpid-cpp-server-xml-0.22-31.el6 qpid-proton-c-0.6-1.el6 qpid-qmf-0.22-25.el6 qpid-tools-0.22-7.el6 -> VERIFIED Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-1296.html |