Description of problem: The condor_configd failed restart subsystems Aviary and Schedd configured with HA. Due to this, configuration is downloaded from wallaby store, but not activated. Version-Release number of selected component (if applicable): python-wallabyclient-4.1.2-1.el6.noarch condor-wallaby-client-4.1.2-1.el6.noarch condor-wallaby-tools-4.1.2-1.el6.noarch condor-wallaby-base-db-1.22-4.el6.noarch ruby-wallaby-0.12.5-1.el6.noarch wallaby-0.12.5-1.el6.noarch wallaby-utils-0.12.5-1.el6.noarch condor-classads-7.6.5-0.15.el6.x86_64 condor-cluster-resource-agent-7.6.5-0.15.el6.x86_64 condor-7.6.5-0.15.el6.x86_64 condor-qmf-7.6.5-0.15.el6.x86_64 condor-aviary-7.6.5-0.15.el6.x86_64 How reproducible: 100% Steps to Reproduce: 1. Set up scheduler, JS/QS over HA 2. Set up wallaby store for HA 3. manually remove configuration /var/lib/condor/wallaby_node.config 4. restart condor Actual results: For example CONDOR_HOST was set in store: # condor_configure_pool -n node3 -l -v | grep -i condor_host CONDOR_HOST = node1 ALLOW_NEGOTIATOR_SCHEDD = $(ALLOW_NEGOTIATOR_SCHEDD), $(CONDOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS) ALLOW_ADMINISTRATOR = $(ALLOW_ADMINISTRATOR), $(CONDOR_HOST) On node3 the CONDOR_HOST is not a part of configuration from wallaby: # grep CONDOR_HOST /var/lib/condor/wallaby_node.config ALLOW_NEGOTIATOR_SCHEDD = $(ALLOW_NEGOTIATOR_SCHEDD), $(CONDOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS) The parameter is in local config only: # condor_config_val -v CONDOR_HOST CONDOR_HOST: node3 Defined in '/etc/condor/config.d/00personal_condor.config', line 3. QMF_BROKER_HOST is set in local config correctly: # condor_config_val -v QMF_BROKER_HOST QMF_BROKER_HOST: node1 Defined in '/etc/condor/config.d/50ha.config', line 1. We can see this error in ConfigLog: 06/26 09:59:42 INFO: Starting Up 06/26 09:59:42 INFO: Hostname is "node3" 06/26 09:59:42 INFO: Cleaning up temporary configuration files 06/26 09:59:57 INFO: Retrieving configuration version "1340609186881558" from the store 06/26 10:00:00 INFO: Retrieved configuration from the store 06/26 10:00:00 ERROR: Failed to send command "condor_restart" to subsystem "query_server" (retval: 1, stdout: "", stderr: "Can't find address for local query_server Perhaps you need to query another pool. ") 06/26 10:00:00 ERROR: Failed to send command "condor_restart" to subsystem "schedd" (retval: 1, stdout: "", stderr: "Can't connect to local schedd ") 06/26 10:00:00 INFO: Exiting No error output found on node1 in /var/log/wallaby/agent.log Expected results: The configuration from the store should be activated. The schedd and JS/QS services should be managed over HA - I think, restart over configd is not required, here.
# rm /var/lib/condor/wallaby_node.config rm: remove regular file `/var/lib/condor/wallaby_node.config'? y # service condor restart Stopping Condor daemons: [ OK ] Starting Condor daemons: [ OK ] # tail -f /var/log/condor/ConfigLog 07/11 08:10:33 INFO: Cleaning up temporary configuration files 07/11 08:10:43 INFO: Retrieving configuration version "1340713488550492" from the store 07/11 08:10:46 INFO: Retrieved configuration from the store 07/11 08:10:47 ERROR: Failed to send command "condor_restart" to subsystem "query_server" (retval: 1, stdout: "", stderr: "Can't find address for local query_server Perhaps you need to query another pool. ") 07/11 08:10:47 ERROR: Failed to send command "condor_restart" to subsystem "schedd" (retval: 1, stdout: "", stderr: "Can't connect to local schedd ") 07/11 08:10:47 INFO: Exiting 07/11 08:10:58 INFO: Starting Up 07/11 08:10:58 INFO: Hostname is "rhel-ha-3.hostname" 07/11 08:10:58 INFO: Cleaning up temporary configuration files # rpm -qa wallaby wallaby-0.12.5-10.el6.noarch I have still the same problem. Please, could you check if the patch was added to the errata packages?
That error is expected by the configd. The configd is using condor_* tools, which will not be able to send messages to any daemon running under RHHA. If you want to restart any daemons running under RHHA, you will need to do so through RHHA tools. Is there a functional problem here?
It looks like functional problem, if the restart fails the new configuration is not activated on the node.
# rm /var/lib/condor/wallaby_node.config rm: remove regular file `/var/lib/condor/wallaby_node.config'? y [root@rhel-ha-3 ~]# service condor restart Stopping Condor daemons: [ OK ] Starting Condor daemons: [ OK ] # tail -f /var/log/condor/ConfigLog 07/11 14:32:38 INFO: Starting Up 07/11 14:32:38 INFO: Hostname is "rhel-ha-3.hostname" 07/11 14:32:38 INFO: Cleaning up temporary configuration files 07/11 14:32:55 INFO: Retrieving configuration version "1342009897670908" from the store 07/11 14:32:59 INFO: Retrieved configuration from the store 07/11 14:33:00 INFO: Exiting 07/11 14:33:12 INFO: Starting Up 07/11 14:33:12 INFO: Hostname is "rhel-ha-3.hostname" 07/11 14:33:12 INFO: Cleaning up temporary configuration files # condor_config_val -v CONDOR_HOST CONDOR_HOST: rhel-ha-1.hostname Defined in '/var/lib/condor/wallaby_node.config', line 102. # condor_config_val -v QMF_BROKER_HOST QMF_BROKER_HOST: rhel-ha-1.hostname Defined in '/var/lib/condor/wallaby_node.config', line 26. >>> VERIFIED