Description of problem: Running 1 wallaby against 1000 configd processes (on a different physical machine), the machine hosting wallaby experienced very high disk utilization (98% reported by iostat) when condor_configure_pool was used to update the default group. The condor_configure_pool command eventually timed out. We then attempted to reproduce the test, but this time moving the config, snapshot, and log files (-d, -s, -l wallaby-agent options) to a fast ssd drive. Using the ssd drive allowed the condor_configure_pool command to complete without timing out. How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Will suggests moving just the config file to the ssd and retest: (10:10:55 AM) willb: kgiusti: what if you just put the -d dir on the ssd?
Example - command used to cause wallaby <-> configd io "storm": [kgiusti@localhost qmf-scale]$ PYTHONPATH="/usr/lib/python2.4/site-packages" condor_configure_pool --default-group -a -f Master Apply these changes [Y/n] ? The following parameters need to be set for this configuration to be valid. QMF_BROKER_HOST CONDOR_HOST Set these parameters now ? [y/N] y QMF_BROKER_HOST: localhost CONDOR_HOST: localhost Configuration applied Save this configuration [y/N] ? Activate the changes [y/N] ? y Configuration activated
Created attachment 413495 [details] scripts used to setup wallaby-agent and configd scale tests see readme file included in tar
0) get scripts from attached tar - see the included README.txt 1) Run wallaby-setup.sh, using a local qpidd and /tmp filesystem for config.db: ./wallaby-setup.sh --dir /tmp/wallaby ps -fwwww ruby /usr/bin/wallaby-agent -H localhost -p 5672 -d /tmp/wallaby/config.db -s /tmp/wallaby/snapshot.db -l /tmp/wallaby/logfile.txt 2) create 1000 configd's on a different host: CONDOR_CONFIGD_CMD="/root/kgiusti/configuration-tools/condor_configd" ./qmf-scale/configd-setup.sh --broker pman08 --port 5672 --count 1000 --name pman04 3) time condor_configure_pool --default-group -a -f Master .... Apply these changes [Y/n] ? Y The following parameters need to be set for this configuration to be valid. QMF_BROKER_HOST CONDOR_HOST Set these parameters now ? [y/N] y QMF_BROKER_HOST: localhost CONDOR_HOST: localhost Configuration applied Save this configuration [y/N] ? n Activate the changes [y/N] ? Y Traceback (most recent call last): File "/usr/sbin/condor_configure_pool", line 736, in ? sys.exit(main()) File "/usr/sbin/condor_configure_pool", line 729, in main activate_configuration(config_store) File "/usr/sbin/condor_configure_pool", line 351, in activate_configuration result = store.activateConfiguration() File "/usr/lib/python2.4/site-packages/qmf/console.py", line 285, in <lambda> return lambda *args, **kwargs : self._invoke(name, args, kwargs) File "/usr/lib/python2.4/site-packages/qmf/console.py", line 419, in _invoke raise RuntimeError("Timed out waiting for method to respond") RuntimeError: Timed out waiting for method to respond real 1m29.660s user 0m0.090s sys 0m0.013s Now - rerun the test, but put wallaby config db on the fast ssd drive: ./wallaby-setup.sh --cdb /ssd/tmp/wallaby/myconfig.db ps -wf ruby /usr/bin/wallaby-agent -H localhost -p 5672 -d /ssd/tmp/wallaby/myconfig.db -s /tmp/snapshot.db -l /tmp/logfile.txt <snip> time condor_configure_pool --default-group -a -f Master Apply these changes [Y/n] ? Y The following parameters need to be set for this configuration to be valid. QMF_BROKER_HOST CONDOR_HOST Set these parameters now ? [y/N] Y QMF_BROKER_HOST: localhost CONDOR_HOST: localhost Configuration applied Save this configuration [y/N] ? Activate the changes [y/N] ? Y Configuration activated real 0m53.190s user 0m0.089s sys 0m0.018s
I believe commit 3db09f0a (included in wallaby-0.3.5-1 and beyond) fixes this problem.
The condor_configd from condor-wallaby-client beyond 0.3.5-1 doesn't support parameters on command line, like -hostname -log-file etc. I tried condor-wallaby-client-2.5-0.1 to reproduce it. Do you have any other idea how to run 1000 configd process on the same machine and give it different parameters?
It should, $ rpm -qf $(which condor_configd) condor-wallaby-client-3.6-6.el5 $ grep -- --hostname $(which condor_configd) if option in ('-h', '--hostname'): $ grep -- --log $(which condor_configd) if option in ('-l', '--logfile'):
Sorry, I meant condor-wallaby-client versions before 0.3.5-1.
The issue should be with wallaby. The version of condor-wallaby-client should only matter if there is an incompatibility between condor-wallaby-client 3.6-6 and wallaby pre 0.3.5-1.
There almost surely is an incompatibility there; wallaby < 0.3.5 is ancient (it even predates the "com.redhat.grid" namespace IIRC, and there have been nontrivial API changes since). Rob would know which version of the configd was first to support the features that Tomas needs for testing; hopefully there is such a version that is compatible with such an old wallaby.
Created attachment 462057 [details] scripts used to setup wallaby-agent and configd scale tests for 1.3 wallaby
Retested over current wallaby: wallaby-0.9.18-2.el5 I have no SSD, fast drive simulated over tmpfs: tmpfs on /tmp type tmpfs (rw,size=512m) [root@localhost ~]# ls -la /tmp total 836 drwxrwxrwt 2 root root 100 Nov 22 10:43 . drwxr-xr-x 24 root root 4096 Nov 18 05:25 .. -rw-r--r-- 1 wallaby condor 221184 Nov 22 10:43 config.db -rw-r--r-- 1 wallaby condor 6781 Nov 22 10:41 logfile.txt -rw-r--r-- 1 wallaby condor 601088 Nov 22 10:43 snapshot.db # time condor_configure_pool --default-group -a -f Master,NodeAccess Apply these changes [Y/n] ? y The following parameters need to be set for this configuration to be valid. ALLOW_READ ALLOW_WRITE CONDOR_HOST Set these parameters now ? [y/N] y ALLOW_READ: * ALLOW_WRITE: * CONDOR_HOST: localhost Configuration applied Create a named snapshot of this configuration [y/N] ? Activate the changes [y/N] ? y Activating configuration. This may take a while, please be patient Configuration activated Configuration saved real 0m37.102s user 0m0.312s sys 0m0.029s [root@hp-sl2x160zg6-02 ~]# No such regression found. Fixed in stable MRG 1.3 release in RHEL5. >>> VERIFIED