835525 – HA Query/Job Server Schedd subsystem restart failed in condor_configd

Bug 835525 - HA Query/Job Server Schedd subsystem restart failed in condor_configd

Summary: HA Query/Job Server Schedd subsystem restart failed in condor_configd

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	condor-wallaby-client
Sub Component:
Version:	Development
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	2.2
Target Release:	---
Assignee:	Will Benton
QA Contact:	Tomas Rusnak
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	807399 807735
TreeView+	depends on / blocked

Reported:	2012-06-26 11:54 UTC by Tomas Rusnak
Modified:	2012-09-25 08:49 UTC (History)
CC List:	5 users (show)
Fixed In Version:	wallaby-0.15.1-2 (backported to wallaby-0.12.5-10)
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2012-09-19 18:26:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Tomas Rusnak 2012-06-26 11:54:56 UTC

Description of problem:
The condor_configd failed restart subsystems Aviary and Schedd configured with HA. Due to this, configuration is downloaded from wallaby store, but not activated.


Version-Release number of selected component (if applicable):
python-wallabyclient-4.1.2-1.el6.noarch
condor-wallaby-client-4.1.2-1.el6.noarch
condor-wallaby-tools-4.1.2-1.el6.noarch
condor-wallaby-base-db-1.22-4.el6.noarch
ruby-wallaby-0.12.5-1.el6.noarch
wallaby-0.12.5-1.el6.noarch
wallaby-utils-0.12.5-1.el6.noarch
condor-classads-7.6.5-0.15.el6.x86_64
condor-cluster-resource-agent-7.6.5-0.15.el6.x86_64
condor-7.6.5-0.15.el6.x86_64
condor-qmf-7.6.5-0.15.el6.x86_64
condor-aviary-7.6.5-0.15.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Set up scheduler, JS/QS over HA
2. Set up wallaby store for HA
3. manually remove configuration /var/lib/condor/wallaby_node.config
4. restart condor
  
Actual results:

For example CONDOR_HOST was set in store:

# condor_configure_pool -n node3 -l -v | grep -i condor_host
  CONDOR_HOST = node1
  ALLOW_NEGOTIATOR_SCHEDD = $(ALLOW_NEGOTIATOR_SCHEDD), $(CONDOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS)
  ALLOW_ADMINISTRATOR = $(ALLOW_ADMINISTRATOR), $(CONDOR_HOST)

On node3 the CONDOR_HOST is not a part of configuration from wallaby:

# grep CONDOR_HOST /var/lib/condor/wallaby_node.config
ALLOW_NEGOTIATOR_SCHEDD = $(ALLOW_NEGOTIATOR_SCHEDD), $(CONDOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS)

The parameter is in local config only:

# condor_config_val -v CONDOR_HOST
CONDOR_HOST: node3
  Defined in '/etc/condor/config.d/00personal_condor.config', line 3.

QMF_BROKER_HOST is set in local config correctly:

# condor_config_val -v QMF_BROKER_HOST
QMF_BROKER_HOST: node1
  Defined in '/etc/condor/config.d/50ha.config', line 1.

We can see this error in ConfigLog:

06/26 09:59:42 INFO: Starting Up
06/26 09:59:42 INFO: Hostname is "node3"
06/26 09:59:42 INFO: Cleaning up temporary configuration files
06/26 09:59:57 INFO: Retrieving configuration version "1340609186881558" from the store
06/26 10:00:00 INFO: Retrieved configuration from the store
06/26 10:00:00 ERROR: Failed to send command "condor_restart" to subsystem "query_server" (retval: 1, stdout: "", stderr: "Can't find address for local query_server
Perhaps you need to query another pool.
")
06/26 10:00:00 ERROR: Failed to send command "condor_restart" to subsystem "schedd" (retval: 1, stdout: "", stderr: "Can't connect to local schedd
")
06/26 10:00:00 INFO: Exiting

No error output found on node1 in /var/log/wallaby/agent.log

Expected results:
The configuration from the store should be activated. The schedd and JS/QS services should be managed over HA - I think, restart over configd is not required, here.

Comment 3 Tomas Rusnak 2012-07-11 08:22:01 UTC

# rm /var/lib/condor/wallaby_node.config
rm: remove regular file `/var/lib/condor/wallaby_node.config'? y
# service condor restart
Stopping Condor daemons:                                   [  OK  ]
Starting Condor daemons:                                   [  OK  ]
# tail -f /var/log/condor/ConfigLog
07/11 08:10:33 INFO: Cleaning up temporary configuration files
07/11 08:10:43 INFO: Retrieving configuration version "1340713488550492" from the store
07/11 08:10:46 INFO: Retrieved configuration from the store
07/11 08:10:47 ERROR: Failed to send command "condor_restart" to subsystem "query_server" (retval: 1, stdout: "", stderr: "Can't find address for local query_server
Perhaps you need to query another pool.
")
07/11 08:10:47 ERROR: Failed to send command "condor_restart" to subsystem "schedd" (retval: 1, stdout: "", stderr: "Can't connect to local schedd
")
07/11 08:10:47 INFO: Exiting
07/11 08:10:58 INFO: Starting Up
07/11 08:10:58 INFO: Hostname is "rhel-ha-3.hostname"
07/11 08:10:58 INFO: Cleaning up temporary configuration files

# rpm -qa wallaby
wallaby-0.12.5-10.el6.noarch

I have still the same problem. Please, could you check if the patch was added to the errata packages?

Comment 4 Robert Rati 2012-07-11 13:01:17 UTC

That error is expected by the configd.  The configd is using condor_* tools, which will not be able to send messages to any daemon running under RHHA.  If you want to restart any daemons running under RHHA, you will need to do so through RHHA tools.  Is there a functional problem here?

Comment 5 Tomas Rusnak 2012-07-11 13:52:16 UTC

It looks like functional problem, if the restart fails the new configuration is not activated on the node.

Comment 6 Tomas Rusnak 2012-07-11 14:38:25 UTC

# rm /var/lib/condor/wallaby_node.config
rm: remove regular file `/var/lib/condor/wallaby_node.config'? y
[root@rhel-ha-3 ~]# service condor restart
Stopping Condor daemons:                                   [  OK  ]
Starting Condor daemons:                                   [  OK  ]
# tail -f /var/log/condor/ConfigLog
07/11 14:32:38 INFO: Starting Up
07/11 14:32:38 INFO: Hostname is "rhel-ha-3.hostname"
07/11 14:32:38 INFO: Cleaning up temporary configuration files

07/11 14:32:55 INFO: Retrieving configuration version "1342009897670908" from the store
07/11 14:32:59 INFO: Retrieved configuration from the store
07/11 14:33:00 INFO: Exiting
07/11 14:33:12 INFO: Starting Up
07/11 14:33:12 INFO: Hostname is "rhel-ha-3.hostname"
07/11 14:33:12 INFO: Cleaning up temporary configuration files

# condor_config_val -v CONDOR_HOST
CONDOR_HOST: rhel-ha-1.hostname
  Defined in '/var/lib/condor/wallaby_node.config', line 102.


# condor_config_val -v QMF_BROKER_HOST
QMF_BROKER_HOST: rhel-ha-1.hostname
  Defined in '/var/lib/condor/wallaby_node.config', line 26.

>>> VERIFIED

Note You need to log in before you can comment on or make changes to this bug.