Bug 581006
| Summary: | qpid-config etc time out when run against a broker with 30000 queues or more | ||
|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Gordon Sim <gsim> |
| Component: | qpid-qmf | Assignee: | Ken Giusti <kgiusti> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Jan Sarenik <jsarenik> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | Development | CC: | freznice, jsarenik, kgiusti, tross |
| Target Milestone: | 1.3 | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2010-10-22 09:53:04 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 598550 | ||
| Bug Blocks: | |||
| Attachments: | |||
|
Description
Gordon Sim
2010-04-09 18:45:07 UTC
Created attachment 406236 [details]
script to set up a large number of queues
I'm using this script to try to repro this problem.
I can't reproduce this on trunk either. After removing and re-installing the same packages as above, I could no longer reproduce the issue for low queue counts. I assume there must have been something messed up in my initial installation. However there is still an issue I think for much larger numbers of queues, e.g. 30000 and above. The --timeout option to qpid-config seems to have no effect on this (not necessarily unexpected given the description of that option in the usage statement), however the tool times out. This is slightly less disturbing now that I've cleaned up my environment, but it would be good to clarify what the issue is and whether we can fix it simply. fyi: on trunk I got above 30000, but timed out with double that. Each of these tools calls into qmf to pull all queue objects defined on the broker. QMF treats this as one transaction (get all objects of class "queue"). QMF applies its "response timer" against this single request. The default value for the response timer is 60 seconds. On my practically idle laptop, I can get up to about 60K queues before I expire the QMF response timer before all queue data has been received. I can think of two ways to address this problem - other opinions welcome: 1) refactor the QMF request/response implementation to apply the timeout on a per-message basis, rather than the whole transaction. As long as valid messages are arriving, we probably shouldn't be timing out. 2) Add a --request-timeout parameter to the qpid commands. Pass this parameter to the qmf getObjects() calls to override the 60 second qmf default. The --timeout option supported by the commands only applies to setting the connection up to the broker. The timescales for that would be shorter than the request time, IMHO, so I don't think using that parameter as the request timeout makes sense - I'd add a new option instead. Opinions? -K Created attachment 406364 [details]
implement timeout backoff in presence of messages from broker
Option 1 was trivial to implement. The result was less than expected:
1) qpid-config will complete processing a request, but it takes a looong time in the presence of a large config. Therefore, it gives the impression of a hang, which is only slightly less better than actually hanging.
2) even with the above patch, qpid-tool goes into a swap-fest (100,000 queue configuration), and grinds my laptop to a halt until I sigterm it. Unlike qpid-config, qpid-tool appears to have some sort of memory amplification issue with large configs.
In light of #2 above, I doubt my other suggestion (--request-timeout option) would be any more successful with qpid-tool.
I think there are two different problems here: 1) the "one-shot" management tools, like qpid-config and qpid-route, are simply timing out due to the amount of data that is being transferred with the broker. If I extend the timeout on the getObject()'s call, then these commands complete successfully. I'd recommend adding a --request-timeout optional parameter to these calls, which would allow the user to override the default qmf timeout if needed. 2) qpid-tool uses a huge amount of memory when invoked against a broker with a large number of queues. Unlike #1 above, qpid-tool becomes unusable - on my system qpid-tool's vm usage hit over 1 gig when invoked against a broker with 100,000 queues. I'd like to say problem #1 can be fixed with a configuration option, #2 is going to require tuning/debugging the memory usage of the qpid-tool app. An update to the behavior:
Package set (qpid-cpp-server-0.7.935473-1.el5, python-qpid-0.7.934605-1.el5, python-qmf-0.7.934605-1.el5, qpid-tools-0.7.934605-2.el5) shows that
qmf queries done 'qpid-stat -b' much more rapidly ends with:
[root@mrg-qe-02 ~]# qpid-stat -b
Failed: Exception - Timed out after 60 seconds
Traceback (most recent call last):
File "/usr/bin/qpid-stat", line 503, in ?
bm.display()
File "/usr/bin/qpid-stat", line 447, in display
self.brokers.append(Broker(self.qmf, b))
File "/usr/bin/qpid-stat", line 124, in __init__
list = qmf.getObjects(_class="queue", _package=package, _agent=self.brokerAgent)
File "/usr/lib/python2.4/site-packages/qmf/console.py", line 762, in getObjects
request.wait(timeout)
File "/usr/lib/python2.4/site-packages/qmf/console.py", line 1250, in wait
raise Exception("Timed out after %d seconds" % timeout)
Exception: Timed out after 60 seconds
qpid-stat -b fails very rapidly with timeout on above listed packages, other qpid-tools fails only in case of a lot of objects.
qpid-stat -b is observed to mostly fail BUT sometimes surprisingly passes, so I cannot say that qmf clients are not working anymore after first timeout.
In reply to comment #8 - qpid-stat uses a different timeout than other qpid tools. qpid-stat does not override the default session timeout (which is 10 seconds). The other tools will override the default timeout. That would explain the differences in behavior between qpid-stat and the other tools. Created attachment 413134 [details]
Updated queue scale test
Current HEAD crashes when attempting to repro the large scale config:
self.amqpSession.channel=None
Exception in thread Thread-4:
Traceback (most recent call last):
File "/usr/lib64/python2.6/threading.py", line 522, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.6/threading.py", line 477, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/kgiusti/Desktop/work/bz/581006/qpid/python/qpid/queue.py", line 84, in run
self.listener(o)
File "/home/kgiusti/Desktop/work/bz/581006/qpid/extras/qmf/src/py/qmf/console.py", line 2313, in _v2Cb
chan.session_completed(self.amqpSession.receiver._completed)
AttributeError: 'NoneType' object has no attribute 'session_completed'
On RHEL5.5 x86_64, qpidd --auth=no --no-module-dir, broker under
very high load*. python-qpid-0.7.946106-1.el5, qpid-tools-0.7.946106-2.el5
* I think this _is_ "very high load":
$ uptime
14:54:53 up 2:50, 1 user, load average: 15.76, 14.63, 13.82
# qpid-config
Exception in thread Thread-4:
Traceback (most recent call last):
File "/usr/lib64/python2.4/threading.py", line 442, in __bootstrap
self.run()
File "/usr/lib64/python2.4/threading.py", line 422, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib/python2.4/site-packages/qpid/queue.py", line 84, in run
self.listener(o)
File "/usr/lib/python2.4/site-packages/qmf/console.py", line 2366, in _v2Cb
self.amqpSession.channel.session_completed(self.amqpSession.receiver._completed)
AttributeError: 'NoneType' object has no attribute 'session_completed'
Failed: Exception: Timed out after 60 seconds
And python-qmf-0.7.946106-3.el5 - the update does not help this bug. Created attachment 416563 [details]
script to set up a large number of queues.
All of the qpid-xxx tools have a hardcoded timeout of 60 seconds. This is the default transaction timeout assigned by the QMF layer. It _may_ be overridden by providing a "_timeout=<secs>" parameter to any call to qmf's getObjects() method, however, none of the qpid tools do this. Using the attached script, I can create approximately 30,000 queues+bindings on my idle T500 laptop (localhost broker) without a timeout. Increasing the # much beyond this will cause the script to hit the 60 second limit. A timeout exception is thrown, as such: Failed: Exception: Timed out after 60 seconds However, if I manually modify qpid-config or qpid-stat to pass _timeout=120 to the various calls to getObjects(), I can scale much further (50,000) without hitting a timeout. When I run this test, I see qpidd briefly peak its cpu util over 90%. This is caused by the broker gathering all the queue+binding info needed to satisfy the getObject() request. Once the broker starts sending this data over to the script, qpidd cpu utilization drops down to the 10%-20% range until all data is transfered. This behavior is expected. However, the real system load originates with the qpid python tool (qpid-config, or qpid-stat, etc). During the data transfer, the python tool runs at > 90% cpu utilization. This is because qpidd (C++) can send messages much faster than the python messaging + qmf can process them. This results in the python processing becoming cpu bound as it attempts to keep up with the incoming data. Again, this behavior is expected. Note that some command options greatly increase the amount of data that is transferred between qpidd and the qpid tool. For example "qpid-config exchanges" simply retrieves all known exchanges. However "qpid-config exchanges -b amq.direct" not only gets all exchanges, but all queues and all bindings. Thus simply adding the "-b amq.direct" filter greatly increases the amount of data (and time and cpu load) required by the "qpid-config" command. Lastly, the crash dump from comment 12 is due to a race condition that occurs only when a timeout is hit, as the tool is not shutting down cleanly. While this is a bug, it is encountered only during a failure scenario - I will open a new BZ to track this issue. I believe the software is operating within the limits of its design. Created attachment 416570 [details]
python script to configure a large number of queues and bindings.
Bug to track crash during timeout: https://bugzilla.redhat.com/show_bug.cgi?id=595914 I have read the explanation. Thank you very much for making it clear comprehensively. Does it mean this bug should be closed as NOTABUG? Or should I put it into VERIFIED after experiencing it is possible to run qpid-tool without timeout when 30000 queues are defined? Hi Jan, It would be a great sanity check if it can be verified that 30k queues can be queried without timeout using: qpid-config queues and qpid-stat -q against a local broker on a machine that is comparable (or better) to my T500: Fedora 11 x86_64, CPU dual core intel T9600 (2.8Ghz), 2GB 1066Mhz memory I'll open a feature request to enhance the qpid tools for better scaling capabilities. I'll detail what I think can be done in that new bug. It times out on ThinkPad X61 laptop, which is Intel(R) Core(TM)2 Duo CPU
T7300 @ 2.00GHz, 2GB SDRAM (155 ns):
### (this was run in the middle of queue creation)
$ qpid-config -a localhost:45886
Total Exchanges: 8
topic: 3
headers: 1
fanout: 1
direct: 3
Total Queues: 19000
durable: 0
non-durable: 19000
$ qpid-config -a localhost:45886
Failed: Exception: Timed out after 60 seconds
$ qpid-config -a localhost:45886 queues
Failed: Exception: Timed out after 60 seconds
$ qpid-stat -q localhost:45886
Failed: Exception - Timed out after 60 seconds
Traceback (most recent call last):
File "/usr/bin/qpid-stat", line 503, in ?
bm.display()
File "/usr/bin/qpid-stat", line 449, in display
self.brokers.append(Broker(self.qmf, self.broker))
File "/usr/bin/qpid-stat", line 124, in __init__
list = qmf.getObjects(_class="queue", _package=package, _agent=self.brokerAgent)
File "/usr/lib/python2.4/site-packages/qmf/console.py", line 800, in getObjects
request.wait(timeout)
File "/usr/lib/python2.4/site-packages/qmf/console.py", line 1402, in wait
raise Exception("Timed out after %d seconds" % timeout)
Exception: Timed out after 60 seconds
And even worse, this times out on a PowerEdge SC1435 server with AMD Quad-Core Opteron, 8GB 667MHz DDR2 RAM: ----------------------------- [root@mrg-qe-09 ~]# python bz581006.py Broker localhost:56606 running... creating 30000 queues... ............................................................................................................................................................................................................................................................................................................Create done. Press <ENTER> to continue... Querying... Failed: Exception: Timed out after 60 seconds real 1m6.075s user 0m53.705s sys 0m9.409s ----------------------------- Both machines are using RHN-updated RHEL5.5 x86_64 with packages qpid-cpp-server-0.7.946106-2.el5 qpid-tools-0.7.946106-4.el5 (I use the python script provided in Comment #16, witho only two changes, the qpidd_cmd path set to "/usr/sbin/qpidd" and similarly for qpid-config.) Worked with Jan yesterday to debug this further, and made the following conclusions: 1) The qpidd process is cpu bound when processing the QMF getObjects request. See https://bugzilla.redhat.com/show_bug.cgi?id=598550. 2) The AMD-based server has multiple cores, however, the per-core compute performance is _less_ than the per-core performance of my Intel Core2 T9600: AMD per core: 2.3Mhz/512K L2/Bogomips=4588 T9600: 2.8Mhz/6MB L2/Bogomips=5586 3) The max queues we can service on the AMD server is roughly 25K, which compares linearly with the relative performance of the processors: AMD is ~82% bogomips of Intel, 25K ~= 28% of 30K. The results Jan is seeing on his systems is expected given the relative capabilities of those systems. However, once BZ598550 is addressed, the maximum queue numbers should improve proportionally. |