Testing mostly on mrg31 which hosts the broker for the MRG grid pool plugins, cumin, etc. it has been observed that the local condor_configd consumes more CPU than expected. In fact, its CPU usage mirrors closely whatever the qpidd is doing at that time within approximately 5%. Sometimes the broker can jump to 30% and the configd follows suit, say 26%. This is believed to be due to the fact that the QMF Session object in the configd is receiving events for various unsolicited QMF activities (eg., agent re-connects) in the entire broker space. A fix has been proposed that will restrict the configd's event intake to that of only the wallaby-agent as follows: self.session = Session(self.console, manageConnections=False, rcvObjects=True, rcvHeartbeats=False, rcvEvents=True, userBindings=True) self.session.bindAgent("com.redhat.grid.config", "Store") self.session.addEventFilter(package='com.redhat.grid.config', event='NodeUpdatedNotice')
The CPU load was observed on all systems running a configd.
Added the bindAgent call. Fixed in: condor-wallaby-3.5-1
I run test with reconnecting agents to broker and configd still has cpu load related to broker load. Tested with (version): condor-wallaby-client-3.6-6.el5 wallaby-utils-0.9.18-2.el5 ruby-wallaby-0.9.18-2.el5 condor-wallaby-base-db-1.4-5.el5 python-wallabyclient-3.6-6.el5 condor-wallaby-tools-3.6-6.el5 wallaby-0.9.18-2.el5
Hi Lubos, I'm having difficulty trying to reproduce the problem using my setup. Can you provide more detail on exactly how you cause the CPU spike to occur? thanks, -K
Created attachment 454311 [details] agent_ruby.rb My reproduction scenario is easy one, I just set auth=no in qpidd.conf and run enough amount of slightly modified agent_ruby.rb using this script: NUM_OF_AGENTS=21 I=0; while true; do if [ "$(ps -eo comm | grep -i agent_ruby.rb | wc -l)" -lt "$NUM_OF_AGENTS" ]; then I=$(($I+1)); ( ./agent_ruby.rb $I > /dev/null 2>&1 & sleep $((${RANDOM}%2)); kill $! ) & else sleep 1; fi; done The used agent_ruby.rb script can be found in attachment.
I can reproduce the problem with Lubos' attachment & script. The test creates and destroys agents rapidly. This causes the following message message pattern to be generated to all consoles each time an agent create/destroy is done: newPackage: org.apache.qpid.qmf399 newClass: 1 org.apache.qpid.qmf399:child(4c09a917-7402-0000-5050-5050d0d2d2d2) newClass: 1 org.apache.qpid.qmf399:parent(68ce7740-acf0-e8ee-c460-a5ac54da7f74) newClass: 2 org.apache.qpid.qmf399:test_event(2686c8c0-f552-f78f-48a4-a0a0a0a0a0a0) newClass: 1 org.apache.qpid.qmf399:child(4c09a917-7402-0000-5050-5050d0d2d2d2) newClass: 1 org.apache.qpid.qmf399:parent(68ce7740-acf0-e8ee-c460-a5ac54da7f74) newClass: 2 org.apache.qpid.qmf399:test_event(2686c8c0-f552-f78f-48a4-a0a0a0a0a0a0) newAgent: Agent(v1) at bank 1.251 (agent_test_label399) objectProps: org.apache.qpid.broker:agent[0-21-1-0-1993] 0-21-1-0-1987 delAgent: Agent(v1) at bank 1.251 (agent_test_label399) objectProps: org.apache.qpid.broker:agent[0-21-1-0-1993] 0-21-1-0-1987 These are qmf-related messages. The newPackage/newClass messages are being generated because each test agent instantiates uniquely-named packages and classes. This forces a newPackage/newClass event on every console. The current QMF impl uses V1 style schema messages, so there is no way (yet) to filter messages of these types. This is known a known QMF behavior that is being addressed in V2. The newAgent/delAgent updates appear to be a QMF bug, in that V1 style agents that are managed by the broker are not being filtered by the bindAgent() call. I will open a BZ against this.
New issue is captured as bug 645015.
I was not able to reproduce the bug on old version: qpid-cpp-client-0.7.946106-11.el5 qpid-cpp-server-devel-0.7.946106-11.el5 qpid-cpp-mrg-debuginfo-0.7.946106-16.el5 qpid-cpp-server-0.7.946106-11.el5 qpid-java-common-0.7.946106-7.el5 qpid-cpp-client-devel-docs-0.7.946106-11.el5 qpid-cpp-client-devel-0.7.946106-11.el5 python-qpid-0.7.946106-11.el5 qpid-cpp-server-store-0.7.946106-11.el5 qpid-cpp-server-xml-0.7.946106-11.el5 qpid-cpp-client-ssl-0.7.946106-11.el5 qpid-cpp-server-cluster-0.7.946106-11.el5 qpid-java-client-0.7.946106-7.el5 qpid-cpp-server-ssl-0.7.946106-11.el5 qpid-tools-0.7.946106-8.el5 condor-wallaby-client-3.4-1.el5 I tried to start/stop condor continuously with condor-qmf installed and configured, but it doesn't load broker enough. I also tried to start/stop multiple sesame processes (without any modification), this test created some load on broker but none on condor_configd. Could you please provide me with reproduction scenario? Thanks, Lubos
Successfully reproduced with multiple qmf-agents start/stoping. Reproduced on: condor-wallaby-client-3.4-1.el5 qpid-cpp-client-ssl-0.7.946106-12.el5 qpid-cpp-server-ssl-0.7.946106-12.el5 qpid-cpp-server-store-0.7.946106-12.el5 qpid-cpp-mrg-debuginfo-0.7.946106-16.el5 qpid-cpp-server-0.7.946106-12.el5 qpid-java-common-0.7.946106-7.el5 qpid-cpp-server-xml-0.7.946106-12.el5 qpid-cpp-client-devel-docs-0.7.946106-12.el5 qpid-cpp-server-cluster-0.7.946106-12.el5 qpid-cpp-server-devel-0.7.946106-12.el5 qpid-cpp-client-devel-0.7.946106-12.el5 python-qpid-0.7.946106-12.el5 qpid-java-client-0.7.946106-7.el5 qpid-tools-0.7.946106-8.el5 qpid-cpp-client-0.7.946106-12.el5 Tested with (version): qpid-cpp-server-xml-0.7.946106-17.el5 qpid-tools-0.7.946106-11.el5 qpid-cpp-mrg-debuginfo-0.7.946106-16.el5 qpid-cpp-server-0.7.946106-17.el5 qpid-cpp-client-rdma-0.7.946106-17.el5 qpid-cpp-server-ssl-0.7.946106-17.el5 qpid-cpp-server-store-0.7.946106-17.el5 qpid-java-client-0.7.946106-11.el5 qpid-cpp-client-0.7.946106-17.el5 qpid-cpp-client-devel-0.7.946106-17.el5 qpid-cpp-server-cluster-0.7.946106-17.el5 qpid-java-common-0.7.946106-11.el5 qpid-java-example-0.7.946106-11.el5 qpid-tests-0.7.946106-1.el5 qpid-cpp-server-devel-0.7.946106-17.el5 rh-qpid-cpp-tests-0.7.946106-17.el5 python-qpid-0.7.946106-14.el5 qpid-cpp-client-ssl-0.7.946106-17.el5 qpid-cpp-server-rdma-0.7.946106-17.el5 ruby-qpid-0.7.946106-2.el5 qpid-cpp-client-devel-docs-0.7.946106-17.el5 condor-wallaby-client-3.6-6 Tested on: RHEL5 x86_64,i386 - passed RHEL4 x86_64,i386 (only configd) - passed >>> VERIFIED