Bug 598550

Summary: Broker experiences high cpu loads when servicing QMF V2 format requests.
Product: Red Hat Enterprise MRG Reporter: Ken Giusti <kgiusti>
Component: qpid-qmfAssignee: Ken Giusti <kgiusti>
Status: CLOSED CURRENTRELEASE QA Contact: Jan Sarenik <jsarenik>
Severity: high Docs Contact:
Priority: low    
Version: betaCC: gsim, jsarenik
Target Milestone: 1.3   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-10-22 09:52:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 616489    
Bug Blocks: 581006    
Attachments:
Description Flags
raw callgrind output from run with QMF Variant-based messaging enabled.
none
raw callgrind output from run with legacy QMF fieldtable-based messaging enabled.
none
callgrind_annotate --threshold=80 ~/callgrind.out.19094.V2OFF > ~/V2OFF_topcalls.txt
none
callgrind_annotate --threshold=80 ~/callgrind.out.19094.V2ON > ~/V2ON_topcalls.txt
none
python script to configure a large number of queues and bindings.
none
Suggested patch none

Description Ken Giusti 2010-06-01 15:37:52 UTC
Description of problem:

See BZ:  https://bugzilla.redhat.com/show_bug.cgi?id=581006

While running the scale tests from the above BZ, we are experiencing very high cpu utilization by qpidd.  If we repeat the test, but disable V2 format messaging in the console.py QMF client, qpidd uses much less cpu and can service requests for a greater number of objects.


Version-Release number of selected component (if applicable):


How reproducible:

100%

Steps to Reproduce:
1.  See the attached BZ for test setup
2.  Monitor CPU use by qpidd during the test.
3.  Disable V2 QMF messaging by setting the brokerSupportsV2 flag to false in console.py:
4.  Repeat test

===================================================================
--- extras/qmf/src/py/qmf/console.py	(revision 949180)
+++ extras/qmf/src/py/qmf/console.py	(working copy)
@@ -2124,6 +2124,7 @@
       direct_result = self.amqpSession.exchange_query("qmf.default.direct")
       topic_result = self.amqpSession.exchange_query("qmf.default.topic")
       self.brokerSupportsV2 = not (direct_result.not_found or topic_result.not_found)
+      self.brokerSupportsV2 = False
 
       try:
         self.cv.acquire()
  

Actual results:

Lower CPU utilization, and greater # of queues can be queried before a command timeout is hit.

Expected results:


Additional info:

Comment 1 Ken Giusti 2010-06-02 12:59:24 UTC
Created attachment 419017 [details]
raw callgrind output from run with QMF Variant-based messaging enabled.

Comment 2 Ken Giusti 2010-06-02 13:00:16 UTC
Created attachment 419018 [details]
raw callgrind output from run with legacy QMF fieldtable-based messaging enabled.

Comment 3 Ken Giusti 2010-06-02 13:04:02 UTC
Created attachment 419022 [details]
callgrind_annotate --threshold=80 ~/callgrind.out.19094.V2OFF > ~/V2OFF_topcalls.txt

Post-processed callgrind trace showing most-often-called functions for legacy QMF field-table based messages.

Comment 4 Ken Giusti 2010-06-02 13:08:30 UTC
Created attachment 419027 [details]
callgrind_annotate --threshold=80 ~/callgrind.out.19094.V2ON > ~/V2ON_topcalls.txt

Pre-processed callgrind trace of top calls for QMF getObjects request using new Variant-style message formats.

Comment 5 Ken Giusti 2010-06-02 13:31:27 UTC
I've wrapped the handleGetQueryLH() routine in ManagementAgent.cpp with clock_gettime() calls to determine the amount of time being spent by the broker building and sending replies to a query for 30K queues.

I run the query with QMF V2 Variant-style message format turned on and off:

30K queues, V2 Variant messages:
2010-06-02 09:21:09 error SEC=6 NSEC=317236582

30K queues, legacy fieldtable messages:
2010-06-02 09:24:55 error SEC=0 NSEC=333698826

The same data set and search algorithms are used in both cases - the only difference between the two runs are the manner in which the reply messages are being built.

A diff of the timing code:


@@ -[kgiusti@localhost 581006]$ svn diff ~/Desktop/work/bz/581006/qpid/cpp/src/qpid/management/ManagementAgent.cpp 
Index: /home/kgiusti/Desktop/work/bz/581006/qpid/cpp/src/qpid/management/ManagementAgent.cpp
===================================================================
--- /home/kgiusti/Desktop/work/bz/581006/qpid/cpp/src/qpid/management/ManagementAgent.cpp	(revision 950130)
+++ /home/kgiusti/Desktop/work/bz/581006/qpid/cpp/src/qpid/management/ManagementAgent.cpp	(working copy)
1480,6 +1492,9 @@
     FieldTable           ft;
     FieldTable::ValuePtr value;
 
+    struct timespec _s, _e;
+    int v1 = clock_gettime(CLOCK_MONOTONIC, &_s);
+
     moveNewObjectsLH();
 
     ft.decode(inBuffer);
@@ -1550,11 +1565,28 @@
     }
 
     sendCommandCompleteLH(replyToKey, sequence);
+
+
+    int v2 = clock_gettime(CLOCK_MONOTONIC, &_e);
+    if (!v1 && !v2) {
+        if (_s.tv_sec == _e.tv_sec) {
+            QPID_LOG(error, "SEC=0 NSEC=" << (_e.tv_nsec - _s.tv_nsec));
+        } else if (_s.tv_sec < _e.tv_sec) {
+            QPID_LOG(error, "SEC=" << (_e.tv_sec - _s.tv_sec) << " NSEC=" << (_e.tv_nsec + 1000000000) - _s.tv_nsec);
+        } else {
+            QPID_LOG(error, "WRAP SEC=" << _s.tv_sec <<  " SEC=" << _e.tv_sec);
+        }
+    }
+
 }
 
 
 void ManagementAgent::handleGetQueryLH(const string& body, const string& replyTo, const string& cid, bool viaLocal)
 {
+    struct timespec _s, _e;
+    int v1 = clock_gettime(CLOCK_MONOTONIC, &_s);
+
+
     moveNewObjectsLH();
 
     Variant::Map inMap;
@@ -1690,6 +1722,20 @@
     ListCodec::encode(list_, content);
     sendBufferLH(content, cid, headers, "amqp/list", v2Direct, replyTo);
     QPID_LOG(trace, "SENT QueryResponse (empty with no 'partial' indicator) to=" << replyTo);
+
+
+    int v2 = clock_gettime(CLOCK_MONOTONIC, &_e);
+    if (!v1 && !v2) {
+        if (_s.tv_sec == _e.tv_sec) {
+            QPID_LOG(error, "SEC=0 NSEC=" << (_e.tv_nsec - _s.tv_nsec));
+        } else if (_s.tv_sec < _e.tv_sec) {
+            QPID_LOG(error, "SEC=" << (_e.tv_sec - _s.tv_sec) << " NSEC=" << (_e.tv_nsec + 1000000000) - _s.tv_nsec);
+        } else {
+            QPID_LOG(error, "WRAP SEC=" << _s.tv_sec <<  " SEC=" << _e.tv_sec);
+        }
+    }
+
+
 }

Comment 6 Ken Giusti 2010-06-02 15:29:21 UTC
Additional timing data.

~30000 variant-based queries, the sum of time across this code:


                    v1 = clock_gettime(CLOCK_MONOTONIC, &_s);

                    object->writeTimestamps(map_);
                    object->mapEncodeValues(values, true, true); // write both stats and properties
                    iter->first.mapEncode(oidMap);

                    map_["_values"] = values;
                    map_["_object_id"] = oidMap;
                    map_["_schema_id"] = mapEncodeSchemaId(object->getPackageName(),
                                                           object->getClassName(),
                                                           "_data",
                                                           object->getMd5Sum());
                    list_.push_back(map_);


                    ListCodec::encode(list_, content);

                    v2 = clock_gettime(CLOCK_MONOTONIC, &_e);


results in 8.1 sec - 8.3 seconds across several runs of the same data.

Timing just the ListCodec::encode() call results in 2.6-3.7 seconds total.

Comment 8 Ken Giusti 2010-06-02 20:40:50 UTC
Created attachment 419159 [details]
python script to configure a large number of queues and bindings.

Comment 9 Ken Giusti 2010-06-02 20:47:04 UTC
How-to run an example of a large amount of qmf traffic.  This can be used with a profiler to gauge the performance of the code.

1) Run a broker (with debug instrumentation or under a profiler).
2) Use the attached script to configure the broker with a large number of queues.  You must pass the address of your broker to this script - even if it is the default address - or the script will start a new broker!

qscale_test.py --broker 127.0.0.1 --count 30000

2b) wait for the setup to complete - you will see the text "Press <ENTER> to continue..." - DON'T PRESS ENTER YET.

3) in another terminal, now run a queue query against the broker: qpid-stat -q

4) repeat #3 as needed, Press ENTER to cause the setup to be torn down when done.

Comment 10 Ken Giusti 2010-06-07 13:30:57 UTC
Some real-world #'s:

I used the above script to configure 30000 queues+bindings on a local broker.

I ran the following qpid-tools against the data set, forcing the tools to use the old-style framing interface (by forcing self.brokerSupportsV2 = False in console.py):

[kgiusti@localhost qpid]$ time ./tools/src/py/qpid-stat -q localhost:42447 > /dev/null

real	0m30.255s
user	0m26.741s
sys	0m2.236s
[kgiusti@localhost qpid]$ time ./tools/src/py/qpid-config -a localhost:42447 queues > /dev/null

real	0m26.326s
user	0m23.110s
sys	0m2.393s


Then I set up the same configuration, but this time using the QMF V2/variant message api:

[kgiusti@localhost qpid]$ time ./tools/src/py/qpid-stat -q localhost:39144 > /dev/null

real	1m11.040s
user	0m56.001s
sys	0m3.753s

[kgiusti@localhost qpid]$ time ./tools/src/py/qpid-config -a localhost:39144 queues > /dev/null

real	1m4.670s
user	0m52.134s
sys	0m3.719s

Comment 11 Gordon Sim 2010-06-08 16:49:31 UTC
Created attachment 422271 [details]
Suggested patch

This reduces some copying of Variants and Variant::Maps and (more significantly in terms of bug 581006) batches the objects for a class-based query.

Comment 12 Gordon Sim 2010-06-11 09:33:21 UTC
A small change to Variant reduces some of the load and has been applied to the release repo (http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commitdiff;h=2e2fa6ccfda4ba82811fae63fa994ed99468245b).

Comment 15 Jan Sarenik 2010-07-20 14:05:42 UTC
What's the status here, please? Should I expect the fix in 1.3?

Comment 16 Ken Giusti 2010-07-20 14:14:33 UTC
Hi Jan,

Sorry I haven't merged in any of Gordon's improvements yet.   I've got a pending fix for https://bugzilla.redhat.com/show_bug.cgi?id=600488 that modifies the object query path.  I want to merge Gordon's updates after the changes made for 600488.  This BZ is next on my queue.

Comment 17 Ken Giusti 2010-07-22 20:57:50 UTC
I have incorporated Gordon's suggestion to use batching of objects in query replies.

Changes submitted to trunk:

Upstream bug info: https://issues.apache.org/jira/browse/QPID-2754

Committed revision 966795.

Will merge this into mrg_1.3.x once changes are reviewed - see https://bugzilla.redhat.com/show_bug.cgi?id=616489

Comment 18 Jan Sarenik 2010-08-04 07:20:41 UTC
What's the status here?

Comment 19 Ken Giusti 2010-08-04 13:23:14 UTC
I'd recommend testing the next broker release from mrg_1.3.x - qpid-cpp-server-0.7.946106-11 - it will contain the bugfix for 616489, plus Gordon's improvements.

Comment 20 Jan Sarenik 2010-09-27 11:39:13 UTC
First minute, qpidd uses 100% of CPU time, the rest of time
the same amount is taken by qpid-stat until it ends with following
exception:

$ qpid-stat -q localhost:36780
Failed: Exception - Timed out after 60 seconds
...

python-qpid-0.7.946106-14.el5
python-qmf-0.7.946106-13.el5
qpid-cpp-server-0.7.946106-16.el5
qpid-tools-0.7.946106-11.el5

Comment 21 Gordon Sim 2010-09-27 11:55:11 UTC
I think we need a more precise statement of the target for verification. The time to execute the query will depend on the number of objects (I assume this latest test was with 30000 queues?) and also the hardware used in testing.

Comment 22 Jan Sarenik 2010-09-27 13:03:25 UTC
Yes, 30000 queues.

I tested on my laptop (which outperformed the AMD-based racked
servers almost twice last time I was playing with this):
Lenovo X61, Intel(R) Core(TM)2 Duo CPU T7300, 2GB RAM.

Comment 23 Ken Giusti 2010-09-27 16:18:52 UTC
Jan,

Is there any other activity on the broker at the time?   The behaviour you are seeing is not consistent with my latest tests against 1.3.

On my laptop - totally idle otherwise - I am running both the broker and qpid-stat/qpid-config locally.  I configured 30K queues.  I see the following results with the latest 1.3 tree:

[kgiusti@localhost qpid]$ time ./tools/src/py/qpid-stat -q | wc
  30008  330069 3660863

real	0m42.454s
user	0m32.369s
sys	0m1.303s

[kgiusti@localhost qpid]$ time ./tools/src/py/qpid-config queues | wc
  30007   90022 1680459

real	0m39.983s
user	0m28.399s
sys	0m1.414s

Compare these results with the results of the same test given in comment #10 above (where I'm using the v2 variant api), which was prior to the latest performance improvements.   

Perhaps a clue: when I run the test, "top" shows that the broker process's cpu utilization runs very high (98%+) but for only about 10 seconds for each run - which sounds very different from what you are experiencing - the broker is cpu bound longer.

To what # of queues are you able to reach without experiencing the command time out?  How long does the broker process remain cpu bound during those commands?

Comment 24 Jan Sarenik 2010-10-04 06:15:28 UTC
Sorry for late reply. Just for the sake of test and statistics I have
measured data for
  python-qpid-0.5.758389-1.el5
  qpidd-0.5.752581-5.el5

$ time qpid-stat -q localhost:60953 | wc
  30005  330036 3090421

real	0m39.903s
user	0m37.517s
sys	0m2.738s
$ time qpid-config -a localhost:60953 queues | wc
  30004   90010 1080151

real	0m35.224s
user	0m32.875s
sys	0m2.889s

Now I will update to latest packages and retest.

Comment 25 Jan Sarenik 2010-10-04 06:31:00 UTC
python-qpid-0.7.946106-14.el5
python-qmf-0.7.946106-13.el5
qpid-cpp-server-0.7.946106-17.el5
qpid-tools-0.7.946106-11.el5

$ time qpid-stat -q $broker | wc
  30008  330069 3240765

real	1m5.249s
user	0m50.456s
sys	0m0.541s
$ time qpid-config -a $broker queues | wc
  30007   90022 1260360

real	1m1.174s
user	0m46.172s
sys	0m0.527s

Comment 26 Jan Sarenik 2010-10-04 06:33:22 UTC
I will execute the test on RHEL4 and both archs as soon as
all the Errata bugs are verified. Sorry for confusion.

Comment 27 Jan Sarenik 2010-10-22 09:52:06 UTC
This is a candidate for automated test. I am adding the TestCaseNeeded
keyword. Closing now, as it was released already.