Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 581006

Summary:

qpid-config etc time out when run against a broker with 30000 queues or more

Product:

Red Hat Enterprise MRG

Reporter:

Gordon Sim <gsim>

Component:

qpid-qmf

Assignee:

Ken Giusti <kgiusti>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Jan Sarenik <jsarenik>

Severity:

high

Docs Contact:

Priority:

high

Version:

Development

CC:

freznice, jsarenik, kgiusti, tross

Target Milestone:

1.3

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2010-10-22 09:53:04 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

598550

Bug Blocks:

Attachments:

Description	Flags
script to set up a large number of queues	none
implement timeout backoff in presence of messages from broker	none
Updated queue scale test	none
script to set up a large number of queues.	none
python script to configure a large number of queues and bindings.	none

Description Gordon Sim 2010-04-09 18:45:07 UTC

Description of problem:

If a broker has lots of queues in it, qpid-config times out when run against the broker. (likewise qpid-tool and qpid-stat, so I think this is an issue in the python QMF console apis perhaps). 

Version-Release number of selected component (if applicable):

python-qpid-0.7.930108-1.el5
python-qmf-0.7.930108-1.el5
qpid-cpp-server-0.7.929717-1.el5

(Haven't checked earlier rpms)

How reproducible:

100%

Steps to Reproduce:
1. create lots of queues (500 and above shows the problem for me)
2. run qpid-config
  
Actual results:

qpid-config times out

Expected results:

qpid-config should report number of queues and exchanges

Additional info:

Comment 1 Ken Giusti 2010-04-13 13:24:58 UTC

Created attachment 406236 [details]
script to set up a large number of queues

I'm using this script to try to repro this problem.

Comment 2 Gordon Sim 2010-04-13 14:47:46 UTC

I can't reproduce this on trunk either.

Comment 3 Gordon Sim 2010-04-13 17:00:22 UTC

After removing and re-installing the same packages as above, I could no longer reproduce the issue for low queue counts. I assume there must have been something messed up in my initial installation.

However there is still an issue I think for much larger numbers of queues, e.g. 30000 and above. The --timeout option to qpid-config seems to have no effect on this (not necessarily unexpected given the description of that option in the usage statement), however the tool times out. This is slightly less disturbing now that I've cleaned up my environment, but it would be good to clarify what the issue is and whether we can fix it simply.

Comment 4 Gordon Sim 2010-04-13 18:24:26 UTC

fyi: on trunk I got above 30000, but timed out with double that.

Comment 5 Ken Giusti 2010-04-13 20:30:40 UTC

Each of these tools calls into qmf to pull all queue objects defined on the broker.  QMF treats this as one transaction (get all objects of class "queue").  QMF applies its "response timer" against this single request.  The default value for the response timer is 60 seconds.

On my practically idle laptop, I can get up to about 60K queues before I expire the QMF response timer before all queue data has been received.

I can think of two ways to address this problem - other opinions welcome:

1) refactor the QMF request/response implementation to apply the timeout on a per-message basis, rather than the whole transaction.  As long as valid messages are arriving, we probably shouldn't be timing out.

2) Add a --request-timeout parameter to the qpid commands.  Pass this parameter to the qmf getObjects() calls to override the 60 second qmf default.

The --timeout option supported by the commands only applies to setting the connection up to the broker.  The timescales for that would be shorter than the request time, IMHO, so I don't think using that parameter as the request timeout makes sense - I'd add a new option instead.

Opinions?

-K

Comment 6 Ken Giusti 2010-04-13 22:28:31 UTC

Created attachment 406364 [details]
implement timeout backoff in presence of messages from broker

Option 1 was trivial to implement.  The result was less than expected:

1) qpid-config will complete processing a request, but it takes a looong time in the presence of a large config.  Therefore, it gives the impression of a hang, which is only slightly less better than actually hanging.

2) even with the above patch, qpid-tool goes into a swap-fest (100,000 queue configuration), and grinds my laptop to a halt until I sigterm it.  Unlike qpid-config, qpid-tool appears to have some sort of memory amplification issue with large configs.

In light of #2 above, I doubt my other suggestion (--request-timeout option) would be any more successful with qpid-tool.

Comment 7 Ken Giusti 2010-04-14 15:26:39 UTC

I think there are two different problems here:

1) the "one-shot" management tools, like qpid-config and qpid-route, are simply timing out due to the amount of data that is being transferred with the broker.  If I extend the timeout on the getObject()'s call, then these commands complete successfully.  I'd recommend adding a --request-timeout optional parameter to these calls, which would allow the user to override the default qmf timeout if needed.

2) qpid-tool uses a huge amount of memory when invoked against a broker with a large number of queues.   Unlike #1 above, qpid-tool becomes unusable - on my system qpid-tool's vm usage hit over 1 gig when invoked against a broker with 100,000 queues.  

I'd like to say problem #1 can be fixed with a configuration option, #2 is going to require tuning/debugging the memory usage of the qpid-tool app.

Comment 8 Frantisek Reznicek 2010-04-29 08:59:13 UTC

An update to the behavior:
Package set (qpid-cpp-server-0.7.935473-1.el5, python-qpid-0.7.934605-1.el5, python-qmf-0.7.934605-1.el5, qpid-tools-0.7.934605-2.el5) shows that
qmf queries done 'qpid-stat -b' much more rapidly ends with:
[root@mrg-qe-02 ~]# qpid-stat -b
Failed: Exception - Timed out after 60 seconds
Traceback (most recent call last):
  File "/usr/bin/qpid-stat", line 503, in ?
    bm.display()
  File "/usr/bin/qpid-stat", line 447, in display
    self.brokers.append(Broker(self.qmf, b))
  File "/usr/bin/qpid-stat", line 124, in __init__
    list = qmf.getObjects(_class="queue", _package=package, _agent=self.brokerAgent)
  File "/usr/lib/python2.4/site-packages/qmf/console.py", line 762, in getObjects
    request.wait(timeout)
  File "/usr/lib/python2.4/site-packages/qmf/console.py", line 1250, in wait
    raise Exception("Timed out after %d seconds" % timeout)
Exception: Timed out after 60 seconds

qpid-stat -b fails very rapidly with timeout on above listed packages, other qpid-tools fails only in case of a lot of objects.

qpid-stat -b is observed to mostly fail BUT sometimes surprisingly passes, so I cannot say that qmf clients are not working anymore after first timeout.

Comment 9 Ken Giusti 2010-05-11 12:55:46 UTC

In reply to comment #8 - qpid-stat uses a different timeout than other qpid tools.  qpid-stat does not override the default session timeout (which is 10 seconds).  The other tools will override the default timeout.  That would explain the differences in behavior between qpid-stat and the other tools.

Comment 10 Ken Giusti 2010-05-11 13:11:07 UTC

Created attachment 413134 [details]
Updated queue scale test

Comment 11 Ken Giusti 2010-05-11 13:14:27 UTC

Current HEAD crashes when attempting to repro the large scale config:


self.amqpSession.channel=None
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib64/python2.6/threading.py", line 522, in __bootstrap_inner
    self.run()
  File "/usr/lib64/python2.6/threading.py", line 477, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/kgiusti/Desktop/work/bz/581006/qpid/python/qpid/queue.py", line 84, in run
    self.listener(o)
  File "/home/kgiusti/Desktop/work/bz/581006/qpid/extras/qmf/src/py/qmf/console.py", line 2313, in _v2Cb
    chan.session_completed(self.amqpSession.receiver._completed)
AttributeError: 'NoneType' object has no attribute 'session_completed'

Comment 12 Jan Sarenik 2010-05-21 12:58:01 UTC

On RHEL5.5 x86_64, qpidd --auth=no --no-module-dir, broker under
very high load*. python-qpid-0.7.946106-1.el5, qpid-tools-0.7.946106-2.el5

 * I think this _is_ "very high load":
   $ uptime
    14:54:53 up  2:50,  1 user,  load average: 15.76, 14.63, 13.82

# qpid-config
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib64/python2.4/threading.py", line 442, in __bootstrap
    self.run()
  File "/usr/lib64/python2.4/threading.py", line 422, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/lib/python2.4/site-packages/qpid/queue.py", line 84, in run
    self.listener(o)
  File "/usr/lib/python2.4/site-packages/qmf/console.py", line 2366, in _v2Cb
    self.amqpSession.channel.session_completed(self.amqpSession.receiver._completed)
AttributeError: 'NoneType' object has no attribute 'session_completed'

Failed: Exception: Timed out after 60 seconds

Comment 13 Jan Sarenik 2010-05-21 13:24:41 UTC

And python-qmf-0.7.946106-3.el5 - the update does not
help this bug.

Comment 14 Ken Giusti 2010-05-25 21:26:50 UTC

Created attachment 416563 [details]
script to set up a large number of queues.

Comment 15 Ken Giusti 2010-05-25 21:56:29 UTC

All of the qpid-xxx tools have a hardcoded timeout of 60 seconds. This is the default transaction timeout assigned by the QMF layer. It _may_ be overridden by providing a "_timeout=<secs>" parameter to any call to qmf's getObjects() method, however, none of the qpid tools do this.

Using the attached script, I can create approximately 30,000 queues+bindings on my idle T500 laptop (localhost broker) without a timeout. Increasing the # much beyond this will cause the script to hit the 60 second limit. A timeout exception is thrown, as such:

Failed: Exception: Timed out after 60 seconds

However, if I manually modify qpid-config or qpid-stat to pass _timeout=120 to the various calls to getObjects(), I can scale much further (50,000) without hitting a timeout.

When I run this test, I see qpidd briefly peak its cpu util over 90%. This is caused by the broker gathering all the queue+binding info needed to satisfy the getObject() request. Once the broker starts sending this data over to the script, qpidd cpu utilization drops down to the 10%-20% range until all data is transfered. This behavior is expected.

However, the real system load originates with the qpid python tool (qpid-config, or qpid-stat, etc). During the data transfer, the python tool runs at > 90% cpu utilization. This is because qpidd (C++) can send messages much faster than the python messaging + qmf can process them. This results in the python processing becoming cpu bound as it attempts to keep up with the incoming data. Again, this behavior is expected.

Note that some command options greatly increase the amount of data that is transferred between qpidd and the qpid tool. For example "qpid-config exchanges" simply retrieves all known exchanges. However "qpid-config exchanges -b amq.direct" not only gets all exchanges, but all queues and all bindings. Thus simply adding the "-b amq.direct" filter greatly increases the amount of data (and time and cpu load) required by the "qpid-config" command.

Lastly, the crash dump from comment 12 is due to a race condition that occurs only when a timeout is hit, as the tool is not shutting down cleanly. While this is a bug, it is encountered only during a failure scenario - I will open a new BZ to track this issue.

I believe the software is operating within the limits of its design.

Comment 16 Ken Giusti 2010-05-25 21:57:21 UTC

Created attachment 416570 [details]
python script to configure a large number of queues and bindings.

Comment 17 Ken Giusti 2010-05-25 22:02:59 UTC

Bug to track crash during timeout:

https://bugzilla.redhat.com/show_bug.cgi?id=595914

Comment 18 Jan Sarenik 2010-05-26 11:59:43 UTC

I have read the explanation. Thank you very much for making
it clear comprehensively.

Does it mean this bug should be closed as NOTABUG?

Or should I put it into VERIFIED after experiencing it is
possible to run qpid-tool without timeout when 30000 queues
are defined?

Comment 19 Ken Giusti 2010-05-26 13:25:28 UTC

Hi Jan,

It would be a great sanity check if it can be verified that 30k queues can be queried without timeout using:


 qpid-config queues

and

 qpid-stat -q


against a local broker on a machine that is comparable (or better) to my T500:

Fedora 11 x86_64, CPU dual core intel T9600 (2.8Ghz), 2GB 1066Mhz memory

I'll open a feature request to enhance the qpid tools for better scaling capabilities.  I'll detail what I think can be done in that new bug.

Comment 20 Jan Sarenik 2010-05-31 06:55:58 UTC

It times out on ThinkPad X61 laptop, which is Intel(R) Core(TM)2 Duo CPU
T7300 @ 2.00GHz, 2GB SDRAM (155 ns):

### (this was run in the middle of queue creation)
$ qpid-config -a localhost:45886
Total Exchanges: 8
          topic: 3
        headers: 1
         fanout: 1
         direct: 3

   Total Queues: 19000
        durable: 0
    non-durable: 19000

$ qpid-config -a localhost:45886
Failed: Exception: Timed out after 60 seconds
$ qpid-config -a localhost:45886 queues
Failed: Exception: Timed out after 60 seconds
$ qpid-stat -q localhost:45886 
Failed: Exception - Timed out after 60 seconds
Traceback (most recent call last):
  File "/usr/bin/qpid-stat", line 503, in ?
    bm.display()
  File "/usr/bin/qpid-stat", line 449, in display
    self.brokers.append(Broker(self.qmf, self.broker))
  File "/usr/bin/qpid-stat", line 124, in __init__
    list = qmf.getObjects(_class="queue", _package=package, _agent=self.brokerAgent)
  File "/usr/lib/python2.4/site-packages/qmf/console.py", line 800, in getObjects
    request.wait(timeout)
  File "/usr/lib/python2.4/site-packages/qmf/console.py", line 1402, in wait
    raise Exception("Timed out after %d seconds" % timeout)
Exception: Timed out after 60 seconds

Comment 21 Jan Sarenik 2010-05-31 07:13:03 UTC

And even worse, this times out on a PowerEdge SC1435 server
with AMD Quad-Core Opteron, 8GB 667MHz DDR2 RAM:

-----------------------------
[root@mrg-qe-09 ~]# python bz581006.py 
Broker localhost:56606 running...
creating 30000 queues...
............................................................................................................................................................................................................................................................................................................Create done.
Press <ENTER> to continue...
Querying...
Failed: Exception: Timed out after 60 seconds

real	1m6.075s
user	0m53.705s
sys	0m9.409s
-----------------------------

Both machines are using RHN-updated RHEL5.5 x86_64 with packages
  qpid-cpp-server-0.7.946106-2.el5
  qpid-tools-0.7.946106-4.el5

(I use the python script provided in Comment #16, witho only two changes,
the qpidd_cmd path set to "/usr/sbin/qpidd" and similarly for qpid-config.)

Comment 22 Ken Giusti 2010-06-02 12:55:52 UTC

Worked with Jan yesterday to debug this further, and made the following conclusions:

1) The qpidd process is cpu bound when processing the QMF getObjects request.  See https://bugzilla.redhat.com/show_bug.cgi?id=598550.

2) The AMD-based server has multiple cores, however, the per-core compute performance is _less_ than the per-core performance of my Intel Core2 T9600:

AMD per core: 2.3Mhz/512K L2/Bogomips=4588
T9600: 2.8Mhz/6MB L2/Bogomips=5586

3) The max queues we can service on the AMD server is roughly 25K, which compares linearly with the relative performance of the processors: AMD is ~82% bogomips of Intel, 25K ~= 28% of 30K.

The results Jan is seeing on his systems is expected given the relative capabilities of those systems.  However, once BZ598550 is addressed, the maximum queue numbers should improve proportionally.