748507 – Wallaby provides DAEMON_LIST = >=MASTER -> condor_master failed to startup

Bug 748507 - Wallaby provides DAEMON_LIST = >=MASTER -> condor_master failed to startup

Summary: Wallaby provides DAEMON_LIST = >=MASTER -> condor_master failed to startup

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	wallaby
Sub Component:
Version:	2.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	2.2
Target Release:	---
Assignee:	Will Benton
QA Contact:	Tomas Rusnak
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	828434
TreeView+	depends on / blocked

Reported:	2011-10-24 16:08 UTC by Matthew Farrellee
Modified:	2012-09-25 08:47 UTC (History)
CC List:	4 users (show)
Fixed In Version:	wallaby-0.14.3-1 (backported to wallaby-0.12.5-10)
Doc Type:	Bug Fix
Doc Text:	C: In some scenarios in which the default group configuration used string append operators, Wallaby could generate spurious node configurations. C: When such spurious node configurations were deployed to Condor nodes, the Condor master could fail to start. F: Wallaby no longer generates such configurations and should work around archived spurious configurations. R: This problem should no longer present.
Clone Of:
Environment:
Last Closed:	2012-09-19 17:41:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2012:1278	0	normal	SHIPPED_LIVE	Moderate: Red Hat Enterprise MRG Grid 2.2 security update	2012-09-19 21:40:26 UTC

Description Matthew Farrellee 2011-10-24 16:08:05 UTC

A new node checked into wallaby, was not added to the +++SKEL group, and condor_configd pulled down a config containing:

$ condor_config_val -v daemon_list
DAEMON_LIST: >=MASTER, CONFIGD
  Defined in '/var/lib/condor/wallaby_node.config', line 30.

A wallaby activate followed by condor_configd -r cleared up the problem. The node was then manually added to +++SKEL (wallaby add-nodes-to-group +++SKEL new-node) and activated again, resulting in the expected configuration.

Wallaby had just been updated to wallaby-0.12.1-1.el5.

The invalid configuration is severe as it requires manual intervention because the condor_master is not running. MasterLog showed,

/var/log/condor/MasterLog:05/23/11 19:20:59 ERROR "Must have the path to >=MASTER defined." at line 1387 in file /builddir/build/BUILD/condor-7.6.3/src/condor_master.V6/masterDaemon.cpp

Also,

$ condor_configure_store -l -f Master | grep DAEMON
DAEMON_LIST = >= MASTER
$ wallaby explain new-node | grep -B1 ^DAEMON_LIST
# DAEMON_LIST is explicitly set in Master, which is installed on the default group
DAEMON_LIST = MASTER

Comment 1 Martin Kudlej 2012-05-28 11:37:46 UTC

I've:
1) set up wallaby store
2) added Master,NodeAccess,ExecuteNode to default group
3) installed wallaby client on nodes
4) restart condor on clients
It hasn't started because:
$ condor_config_val DAEMON_LIST
DAEMON_LIST = >=>=STARTD, MASTER, CONFIGD
$ tail /var/log/condor/Master
>=>=STARTD is in the DAEMON_LIST parameter, but there is no executable path for it defined in the config files!
05/28/12 06:30:44 ERROR "Must have the path to >=>=STARTD defined." at line 1388 in file /builddir/build/BUILD/condor-7.6.7/src/condor_master.V6/masterDaemon.cpp

Comment 2 Martin Kudlej 2012-05-28 11:39:12 UTC

"rm /var/lib/condor/wallaby_node.config" has helped.

Comment 3 Martin Kudlej 2012-05-28 11:40:49 UTC

Uh, sorry. It has helped just until load new wallaby configuration for clients.

Comment 4 Will Benton 2012-05-30 18:30:04 UTC

I've reproduced this now.

Comment 5 Martin Kudlej 2012-06-29 11:22:27 UTC

Could you please write here how to reproduce it?

Comment 6 Will Benton 2012-06-29 13:26:52 UTC

Martin, the procedure you describe in comment 1 does the trick.  The key is having a feature on the default group that appends to DAEMON_LIST (the ones you added will do), then activating the Wallaby configuration before having a new node check in.  You can see an example of this in the following test case:

http://git.fedorahosted.org/git/?p=grid/wallaby.git;a=blob;f=spec/bz748507_spec.rb;hb=HEAD

Comment 9 Will Benton 2012-08-13 20:15:52 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
C:  In some scenarios in which the default group configuration used string append operators, Wallaby could generate spurious node configurations.
C:  When such spurious node configurations were deployed to Condor nodes, the Condor master could fail to start.
F:  Wallaby no longer generates such configurations and should work around archived spurious configurations.
R:  This problem should no longer present.

Comment 10 Tomas Rusnak 2012-08-21 15:45:21 UTC

Retested on RHEL5/RHEL6 with:
wallaby-0.12.5-1.el6.noarch

# /usr/bin/wallaby load /var/lib/condor-wallaby-base-db/condor-base-db.snapshot
# wallaby add-features-to-group +++DEFAULT Master NodeAccess ExecuteNode
Console Connection Established...
# wallaby add-params-to-feature NodeAccess ALLOW_READ=* ALLOW_WRITE=*
Console Connection Established...
# wallaby add-params-to-group +++DEFAULT CONDOR_HOST=hostname
Console Connection Established...
# wallaby activate
Console Connection Established...

On clients:
# condor_config_val DAEMON_LIST
STARTD, MASTER, CONFIGD

No error found in MasterLog anymore:

# tail  /var/log/condor/MasterLog 
08/21/12 11:39:47 Using local config sources: 
08/21/12 11:39:47    /etc/condor/config.d/00personal_condor.config
08/21/12 11:39:47    /etc/condor/config.d/99configd.config
08/21/12 11:39:47    /etc/condor/config.d/zzz_condor_config.test
08/21/12 11:39:47    /var/lib/condor/wallaby_node.config
08/21/12 11:39:47 DaemonCore: command socket at <IP:35639>
08/21/12 11:39:47 DaemonCore: private command socket at <IP:35639>
08/21/12 11:39:47 Setting maximum accepts per cycle 8.
08/21/12 11:39:47 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 19031
08/21/12 11:39:47 Started process "/usr/sbin/condor_configd", pid and pgroup = 19032

>>> VERIFIED

Comment 12 errata-xmlrpc 2012-09-19 17:41:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-1278.html

Note You need to log in before you can comment on or make changes to this bug.