Bug 598550
Description
Ken Giusti
2010-06-01 15:37:52 UTC
Created attachment 419017 [details]
raw callgrind output from run with QMF Variant-based messaging enabled.
Created attachment 419018 [details]
raw callgrind output from run with legacy QMF fieldtable-based messaging enabled.
Created attachment 419022 [details]
callgrind_annotate --threshold=80 ~/callgrind.out.19094.V2OFF > ~/V2OFF_topcalls.txt
Post-processed callgrind trace showing most-often-called functions for legacy QMF field-table based messages.
Created attachment 419027 [details]
callgrind_annotate --threshold=80 ~/callgrind.out.19094.V2ON > ~/V2ON_topcalls.txt
Pre-processed callgrind trace of top calls for QMF getObjects request using new Variant-style message formats.
I've wrapped the handleGetQueryLH() routine in ManagementAgent.cpp with clock_gettime() calls to determine the amount of time being spent by the broker building and sending replies to a query for 30K queues. I run the query with QMF V2 Variant-style message format turned on and off: 30K queues, V2 Variant messages: 2010-06-02 09:21:09 error SEC=6 NSEC=317236582 30K queues, legacy fieldtable messages: 2010-06-02 09:24:55 error SEC=0 NSEC=333698826 The same data set and search algorithms are used in both cases - the only difference between the two runs are the manner in which the reply messages are being built. A diff of the timing code: @@ -[kgiusti@localhost 581006]$ svn diff ~/Desktop/work/bz/581006/qpid/cpp/src/qpid/management/ManagementAgent.cpp Index: /home/kgiusti/Desktop/work/bz/581006/qpid/cpp/src/qpid/management/ManagementAgent.cpp =================================================================== --- /home/kgiusti/Desktop/work/bz/581006/qpid/cpp/src/qpid/management/ManagementAgent.cpp (revision 950130) +++ /home/kgiusti/Desktop/work/bz/581006/qpid/cpp/src/qpid/management/ManagementAgent.cpp (working copy) 1480,6 +1492,9 @@ FieldTable ft; FieldTable::ValuePtr value; + struct timespec _s, _e; + int v1 = clock_gettime(CLOCK_MONOTONIC, &_s); + moveNewObjectsLH(); ft.decode(inBuffer); @@ -1550,11 +1565,28 @@ } sendCommandCompleteLH(replyToKey, sequence); + + + int v2 = clock_gettime(CLOCK_MONOTONIC, &_e); + if (!v1 && !v2) { + if (_s.tv_sec == _e.tv_sec) { + QPID_LOG(error, "SEC=0 NSEC=" << (_e.tv_nsec - _s.tv_nsec)); + } else if (_s.tv_sec < _e.tv_sec) { + QPID_LOG(error, "SEC=" << (_e.tv_sec - _s.tv_sec) << " NSEC=" << (_e.tv_nsec + 1000000000) - _s.tv_nsec); + } else { + QPID_LOG(error, "WRAP SEC=" << _s.tv_sec << " SEC=" << _e.tv_sec); + } + } + } void ManagementAgent::handleGetQueryLH(const string& body, const string& replyTo, const string& cid, bool viaLocal) { + struct timespec _s, _e; + int v1 = clock_gettime(CLOCK_MONOTONIC, &_s); + + moveNewObjectsLH(); Variant::Map inMap; @@ -1690,6 +1722,20 @@ ListCodec::encode(list_, content); sendBufferLH(content, cid, headers, "amqp/list", v2Direct, replyTo); QPID_LOG(trace, "SENT QueryResponse (empty with no 'partial' indicator) to=" << replyTo); + + + int v2 = clock_gettime(CLOCK_MONOTONIC, &_e); + if (!v1 && !v2) { + if (_s.tv_sec == _e.tv_sec) { + QPID_LOG(error, "SEC=0 NSEC=" << (_e.tv_nsec - _s.tv_nsec)); + } else if (_s.tv_sec < _e.tv_sec) { + QPID_LOG(error, "SEC=" << (_e.tv_sec - _s.tv_sec) << " NSEC=" << (_e.tv_nsec + 1000000000) - _s.tv_nsec); + } else { + QPID_LOG(error, "WRAP SEC=" << _s.tv_sec << " SEC=" << _e.tv_sec); + } + } + + } Additional timing data. ~30000 variant-based queries, the sum of time across this code: v1 = clock_gettime(CLOCK_MONOTONIC, &_s); object->writeTimestamps(map_); object->mapEncodeValues(values, true, true); // write both stats and properties iter->first.mapEncode(oidMap); map_["_values"] = values; map_["_object_id"] = oidMap; map_["_schema_id"] = mapEncodeSchemaId(object->getPackageName(), object->getClassName(), "_data", object->getMd5Sum()); list_.push_back(map_); ListCodec::encode(list_, content); v2 = clock_gettime(CLOCK_MONOTONIC, &_e); results in 8.1 sec - 8.3 seconds across several runs of the same data. Timing just the ListCodec::encode() call results in 2.6-3.7 seconds total. Created attachment 419159 [details]
python script to configure a large number of queues and bindings.
How-to run an example of a large amount of qmf traffic. This can be used with a profiler to gauge the performance of the code. 1) Run a broker (with debug instrumentation or under a profiler). 2) Use the attached script to configure the broker with a large number of queues. You must pass the address of your broker to this script - even if it is the default address - or the script will start a new broker! qscale_test.py --broker 127.0.0.1 --count 30000 2b) wait for the setup to complete - you will see the text "Press <ENTER> to continue..." - DON'T PRESS ENTER YET. 3) in another terminal, now run a queue query against the broker: qpid-stat -q 4) repeat #3 as needed, Press ENTER to cause the setup to be torn down when done. Some real-world #'s: I used the above script to configure 30000 queues+bindings on a local broker. I ran the following qpid-tools against the data set, forcing the tools to use the old-style framing interface (by forcing self.brokerSupportsV2 = False in console.py): [kgiusti@localhost qpid]$ time ./tools/src/py/qpid-stat -q localhost:42447 > /dev/null real 0m30.255s user 0m26.741s sys 0m2.236s [kgiusti@localhost qpid]$ time ./tools/src/py/qpid-config -a localhost:42447 queues > /dev/null real 0m26.326s user 0m23.110s sys 0m2.393s Then I set up the same configuration, but this time using the QMF V2/variant message api: [kgiusti@localhost qpid]$ time ./tools/src/py/qpid-stat -q localhost:39144 > /dev/null real 1m11.040s user 0m56.001s sys 0m3.753s [kgiusti@localhost qpid]$ time ./tools/src/py/qpid-config -a localhost:39144 queues > /dev/null real 1m4.670s user 0m52.134s sys 0m3.719s Created attachment 422271 [details] Suggested patch This reduces some copying of Variants and Variant::Maps and (more significantly in terms of bug 581006) batches the objects for a class-based query. A small change to Variant reduces some of the load and has been applied to the release repo (http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commitdiff;h=2e2fa6ccfda4ba82811fae63fa994ed99468245b). What's the status here, please? Should I expect the fix in 1.3? Hi Jan, Sorry I haven't merged in any of Gordon's improvements yet. I've got a pending fix for https://bugzilla.redhat.com/show_bug.cgi?id=600488 that modifies the object query path. I want to merge Gordon's updates after the changes made for 600488. This BZ is next on my queue. I have incorporated Gordon's suggestion to use batching of objects in query replies. Changes submitted to trunk: Upstream bug info: https://issues.apache.org/jira/browse/QPID-2754 Committed revision 966795. Will merge this into mrg_1.3.x once changes are reviewed - see https://bugzilla.redhat.com/show_bug.cgi?id=616489 What's the status here? I'd recommend testing the next broker release from mrg_1.3.x - qpid-cpp-server-0.7.946106-11 - it will contain the bugfix for 616489, plus Gordon's improvements. First minute, qpidd uses 100% of CPU time, the rest of time the same amount is taken by qpid-stat until it ends with following exception: $ qpid-stat -q localhost:36780 Failed: Exception - Timed out after 60 seconds ... python-qpid-0.7.946106-14.el5 python-qmf-0.7.946106-13.el5 qpid-cpp-server-0.7.946106-16.el5 qpid-tools-0.7.946106-11.el5 I think we need a more precise statement of the target for verification. The time to execute the query will depend on the number of objects (I assume this latest test was with 30000 queues?) and also the hardware used in testing. Yes, 30000 queues. I tested on my laptop (which outperformed the AMD-based racked servers almost twice last time I was playing with this): Lenovo X61, Intel(R) Core(TM)2 Duo CPU T7300, 2GB RAM. Jan, Is there any other activity on the broker at the time? The behaviour you are seeing is not consistent with my latest tests against 1.3. On my laptop - totally idle otherwise - I am running both the broker and qpid-stat/qpid-config locally. I configured 30K queues. I see the following results with the latest 1.3 tree: [kgiusti@localhost qpid]$ time ./tools/src/py/qpid-stat -q | wc 30008 330069 3660863 real 0m42.454s user 0m32.369s sys 0m1.303s [kgiusti@localhost qpid]$ time ./tools/src/py/qpid-config queues | wc 30007 90022 1680459 real 0m39.983s user 0m28.399s sys 0m1.414s Compare these results with the results of the same test given in comment #10 above (where I'm using the v2 variant api), which was prior to the latest performance improvements. Perhaps a clue: when I run the test, "top" shows that the broker process's cpu utilization runs very high (98%+) but for only about 10 seconds for each run - which sounds very different from what you are experiencing - the broker is cpu bound longer. To what # of queues are you able to reach without experiencing the command time out? How long does the broker process remain cpu bound during those commands? Sorry for late reply. Just for the sake of test and statistics I have measured data for python-qpid-0.5.758389-1.el5 qpidd-0.5.752581-5.el5 $ time qpid-stat -q localhost:60953 | wc 30005 330036 3090421 real 0m39.903s user 0m37.517s sys 0m2.738s $ time qpid-config -a localhost:60953 queues | wc 30004 90010 1080151 real 0m35.224s user 0m32.875s sys 0m2.889s Now I will update to latest packages and retest. python-qpid-0.7.946106-14.el5 python-qmf-0.7.946106-13.el5 qpid-cpp-server-0.7.946106-17.el5 qpid-tools-0.7.946106-11.el5 $ time qpid-stat -q $broker | wc 30008 330069 3240765 real 1m5.249s user 0m50.456s sys 0m0.541s $ time qpid-config -a $broker queues | wc 30007 90022 1260360 real 1m1.174s user 0m46.172s sys 0m0.527s I will execute the test on RHEL4 and both archs as soon as all the Errata bugs are verified. Sorry for confusion. This is a candidate for automated test. I am adding the TestCaseNeeded keyword. Closing now, as it was released already. |