Red Hat Bugzilla – Bug 625607
condor_configd (incorrectly) concludes there is no startd running when startd(s) given nonstandard names
Last modified: 2011-12-08 14:59:37 EST
Description of problem:
In case of configurations where startds are run under nonstandard daemon-list names (i.e., not STARTD), condor_configd will incorrectly conclude no startd is running, as it searches for "STARTD" in DAEMON_LIST.
For example, if I have a machine configured with:
DAEMON_LIST = MASTER,STARTD_ST1,STARTD_ST2,STARTD_ST3 ...
If I attempt to activate a config store, I will see in the configd log output like this:
08/19 17:58:30 INFO: Retrieving configuration version "1282255107129214" from the store
08/19 17:58:42 DEBUG: Retrieved configuration from the store
08/19 17:58:42 DEBUG: Daemons to restart: [u'startd']
08/19 17:58:42 DEBUG: Daemons to reconfig: 
08/19 17:58:42 DEBUG: Not sending "condor_restart" to subsystem "startd" since it is not currently running
Steps to Reproduce:
1. configure a condor node where there is a startd running, but named something nonstandard in DAEMON_LIST (e.g. STARTD_ST1, or some such)
2. make a modification to a parameter requiring a restart (or reconfig?) for that condor node
3. activate the configuration (condor_configure_pool --activate), while watching the log output on ConfigLog for the condor node
condor_configd will claim there is no startd running, and not restart the startds (see above).
supposed to restart any startds running.
see line 428 of condor_configd:
(retval, daemons, err) = run_cmd('condor_config_val -master DAEMON_LIST')
And also method act_upon_subsys_list()
Proposal for fix:
define a config variable:
<subsys>_WALLABY_EQUIV = <equiv1>, <equiv2> ...
STARTD_WALLABY_EQUIV = STARTD_ST1, STARTD_ST2, ... STARTD_ST90
Update the configd script to check for these variables -- if one is defined, then replace <subsys> with <equiv1>, <equiv2> ... as paramter to condor_restart (or reconfig).
Another idea: for params that are of the form X.Y, assume that X is a subsystem and that X.Y has the same restart/reconfigure behavior as (unqualified) Y. Then derive subsystems for qualified parameters implicitly.
A subsystem in the wallaby store corresponds to a condor daemon to be restarted. In the case where there are multiple similar daemons, like multiple startds, a subsystem in the store should correspond to a subsystem/daemon condor will be running and monitoring. To copy a startd subsystem to a new subsystem called startd_1:
condor_configure_store -a -s startd_1
condor_configure_store -e -s startd,startd_1
Then copy all entries from startd to startd_1.