625607 – condor_configd (incorrectly) concludes there is no startd running when startd(s) given nonstandard names

Bug 625607 - condor_configd (incorrectly) concludes there is no startd running when startd(s) given nonstandard names

Summary: condor_configd (incorrectly) concludes there is no startd running when startd...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	wallaby-utils
Sub Component:
Version:	1.3
Hardware:	All
OS:	All
Priority:	low
Severity:	medium
Target Milestone:	2.1.1
Target Release:	---
Assignee:	Robert Rati
QA Contact:	MRG Quality Engineering
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-08-19 22:22 UTC by Erik Erlandson
Modified:	2011-12-08 19:59 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-11-22 01:15:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Erik Erlandson 2010-08-19 22:22:20 UTC

Description of problem:

In case of configurations where startds are run under nonstandard daemon-list names (i.e., not STARTD), condor_configd will incorrectly conclude no startd is running, as it searches for "STARTD" in DAEMON_LIST.

For example, if I have a machine configured with:

DAEMON_LIST = MASTER,STARTD_ST1,STARTD_ST2,STARTD_ST3 ...

If I attempt to activate a config store, I will see in the configd log output like this:

08/19 17:58:30 INFO: Retrieving configuration version "1282255107129214" from the store
08/19 17:58:42 DEBUG: Retrieved configuration from the store
08/19 17:58:42 DEBUG: Daemons to restart: [u'startd']
08/19 17:58:42 DEBUG: Daemons to reconfig: []
08/19 17:58:42 DEBUG: Not sending "condor_restart" to subsystem "startd" since it is not currently running



Steps to Reproduce:
1. configure a condor node where there is a startd running, but named something nonstandard in DAEMON_LIST (e.g.  STARTD_ST1, or some such)
2. make a modification to a parameter requiring a restart (or reconfig?) for that condor node
3. activate the configuration (condor_configure_pool --activate), while watching the log output on ConfigLog for the condor node
  
Actual results:

condor_configd will claim there is no startd running, and not restart the startds (see above).

Expected results:

supposed to restart any startds running.


Additional info:
see line 428 of condor_configd:
(retval, daemons, err) = run_cmd('condor_config_val -master DAEMON_LIST')

And also method act_upon_subsys_list()

Comment 1 Erik Erlandson 2010-08-25 16:04:50 UTC

Proposal for fix:

define a config variable:

<subsys>_WALLABY_EQUIV = <equiv1>, <equiv2> ...

for example:

STARTD_WALLABY_EQUIV = STARTD_ST1, STARTD_ST2, ...  STARTD_ST90

Update the configd script to check for these variables -- if one is defined, then replace <subsys> with <equiv1>, <equiv2> ... as paramter to condor_restart (or reconfig).

Comment 2 Will Benton 2010-08-26 15:12:48 UTC

Another idea:  for params that are of the form X.Y, assume that X is a subsystem and that X.Y has the same restart/reconfigure behavior as (unqualified) Y. Then derive subsystems for qualified parameters implicitly.

Comment 3 Robert Rati 2010-08-26 15:37:31 UTC

A subsystem in the wallaby store corresponds to a condor daemon to be restarted.  In the case where there are multiple similar daemons, like multiple startds, a subsystem in the store should correspond to a subsystem/daemon condor will be running and monitoring.  To copy a startd subsystem to a new subsystem called startd_1:

condor_configure_store -a -s startd_1

condor_configure_store -e -s startd,startd_1

Then copy all entries from startd to startd_1.

Note You need to log in before you can comment on or make changes to this bug.