A new node checked into wallaby, was not added to the +++SKEL group, and condor_configd pulled down a config containing: $ condor_config_val -v daemon_list DAEMON_LIST: >=MASTER, CONFIGD Defined in '/var/lib/condor/wallaby_node.config', line 30. A wallaby activate followed by condor_configd -r cleared up the problem. The node was then manually added to +++SKEL (wallaby add-nodes-to-group +++SKEL new-node) and activated again, resulting in the expected configuration. Wallaby had just been updated to wallaby-0.12.1-1.el5. The invalid configuration is severe as it requires manual intervention because the condor_master is not running. MasterLog showed, /var/log/condor/MasterLog:05/23/11 19:20:59 ERROR "Must have the path to >=MASTER defined." at line 1387 in file /builddir/build/BUILD/condor-7.6.3/src/condor_master.V6/masterDaemon.cpp Also, $ condor_configure_store -l -f Master | grep DAEMON DAEMON_LIST = >= MASTER $ wallaby explain new-node | grep -B1 ^DAEMON_LIST # DAEMON_LIST is explicitly set in Master, which is installed on the default group DAEMON_LIST = MASTER
I've: 1) set up wallaby store 2) added Master,NodeAccess,ExecuteNode to default group 3) installed wallaby client on nodes 4) restart condor on clients It hasn't started because: $ condor_config_val DAEMON_LIST DAEMON_LIST = >=>=STARTD, MASTER, CONFIGD $ tail /var/log/condor/Master >=>=STARTD is in the DAEMON_LIST parameter, but there is no executable path for it defined in the config files! 05/28/12 06:30:44 ERROR "Must have the path to >=>=STARTD defined." at line 1388 in file /builddir/build/BUILD/condor-7.6.7/src/condor_master.V6/masterDaemon.cpp
"rm /var/lib/condor/wallaby_node.config" has helped.
Uh, sorry. It has helped just until load new wallaby configuration for clients.
I've reproduced this now.
Could you please write here how to reproduce it?
Martin, the procedure you describe in comment 1 does the trick. The key is having a feature on the default group that appends to DAEMON_LIST (the ones you added will do), then activating the Wallaby configuration before having a new node check in. You can see an example of this in the following test case: http://git.fedorahosted.org/git/?p=grid/wallaby.git;a=blob;f=spec/bz748507_spec.rb;hb=HEAD
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: C: In some scenarios in which the default group configuration used string append operators, Wallaby could generate spurious node configurations. C: When such spurious node configurations were deployed to Condor nodes, the Condor master could fail to start. F: Wallaby no longer generates such configurations and should work around archived spurious configurations. R: This problem should no longer present.
Retested on RHEL5/RHEL6 with: wallaby-0.12.5-1.el6.noarch # /usr/bin/wallaby load /var/lib/condor-wallaby-base-db/condor-base-db.snapshot # wallaby add-features-to-group +++DEFAULT Master NodeAccess ExecuteNode Console Connection Established... # wallaby add-params-to-feature NodeAccess ALLOW_READ=* ALLOW_WRITE=* Console Connection Established... # wallaby add-params-to-group +++DEFAULT CONDOR_HOST=hostname Console Connection Established... # wallaby activate Console Connection Established... On clients: # condor_config_val DAEMON_LIST STARTD, MASTER, CONFIGD No error found in MasterLog anymore: # tail /var/log/condor/MasterLog 08/21/12 11:39:47 Using local config sources: 08/21/12 11:39:47 /etc/condor/config.d/00personal_condor.config 08/21/12 11:39:47 /etc/condor/config.d/99configd.config 08/21/12 11:39:47 /etc/condor/config.d/zzz_condor_config.test 08/21/12 11:39:47 /var/lib/condor/wallaby_node.config 08/21/12 11:39:47 DaemonCore: command socket at <IP:35639> 08/21/12 11:39:47 DaemonCore: private command socket at <IP:35639> 08/21/12 11:39:47 Setting maximum accepts per cycle 8. 08/21/12 11:39:47 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 19031 08/21/12 11:39:47 Started process "/usr/sbin/condor_configd", pid and pgroup = 19032 >>> VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-1278.html