condor-wallaby-client-4.0-5.el5 --- $ wallaby inventory Console Connection Established... node name is provisioned? last checkin --------- --------------- ------------ ... mrg26 provisioned Tue May 10 04:54:42 -0400 2011 ... [root@mrg26 ~]# tail /var/log/condor/ConfigLog 05/09 11:44:48 INFO: Starting Up 05/09 11:44:48 INFO: Hostname is "mrg26" 05/09 11:44:48 INFO: Cleaning up temporary configuration files 05/09 11:48:29 INFO: Retrieving configuration version "1304956105456471" from the store 05/09 11:48:31 INFO: Retrieved configuration from the store 05/09 11:48:32 INFO: Exiting 05/09 11:48:33 INFO: Starting Up 05/09 11:48:33 INFO: Hostname is "mrg26" 05/09 11:48:33 INFO: Cleaning up temporary configuration files 05/10 01:53:02 CRITICAL: Unable to get node information object [root@mrg26 ~]# tail /var/log/condor/MasterLog 05/09/11 11:48:33 Started process "/usr/sbin/condor_startd", pid and pgroup = 11232 05/09/11 11:48:33 Started process "/usr/sbin/condor_startd", pid and pgroup = 11241 05/09/11 11:48:33 Started process "/usr/sbin/condor_startd", pid and pgroup = 11253 05/09/11 11:48:33 Started process "/usr/sbin/condor_startd", pid and pgroup = 11262 05/09/11 11:48:33 Started process "/usr/sbin/condor_configd", pid and pgroup = 11264 05/09/11 12:48:33 Preen pid is 1570 05/10/11 12:48:33 Preen pid is 20805 05/11/11 12:48:33 Preen pid is 25009 05/12/11 12:48:33 Preen pid is 29218 05/13/11 12:48:33 Preen pid is 974 --- It is unknown what caused the "Unable to get node information object" but a restart (condor_off/on -subsys configd) appears to get it back to a proper state. Possibly the configd should exit on the critical error and allow the master to restart it fresh.
It doesn't look like debug logging is enabled, so that CRITICAL error could have been a one time event. That error message comes when the configd attempts to retrieve the node object from the store. That obviously failed at one point (it looks like during a periodic check-in), but that shouldn't be a hard failure. Was some configuration through wallaby attempted to show that the node wasn't communicating with wallaby? I looked at mrg11's log as well as saw a number of critical errors, but none of them gave any indication of preventing the configd from running. In fact, mrg11 has a string of CRITICAL errors on 4/28-4/30 but the config continues to run and even pulls a config later. I'm guessing that there has been some broker issues, but I'm not seeing anything in the logs indicating that the configd stopped functioning. I suggest enabling debug logging for the configd on all nodes in the pool.
If the broker is restarted, or other communication disruption, while the configd is performing it's periodic checkin it is possible that it occurred in an area that wasn't catching the exception. The exception is now caught and handled appropriately. Additionally, the configd now monitors both the interval timer and on windows the shutdown timer. If either exits, it is restarted. Fixed on BZ704597-update-timer-stop
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: C: A communication interruption when the condor_configd is checking in with the configuration store C: The thread controlling the periodic updating exits, and the node no longer checks in periodically with the configuration store F: Handle the exception that can arise and monitor all child threads to ensure they are running. If one is found not running, restart it. R: The configd will not stop periodically checking in
Reproduced with: $CondorVersion: 7.6.1 Jun 02 2011 BuildID: RH-7.6.1-0.10.el5 $ $CondorPlatform: X86_64-RedHat_5.6 $ condor-wallaby-client-4.0-6.el5 08/09 16:21:07 DEBUG: The system is already running configuration version "1312893808070215" 08/09 16:21:17 DEBUG: Checking version of configuration 08/09 16:21:21 DEBUG: Lost connection to the configuration store 08/09 16:21:21 CRITICAL: Unable to get node information object
Retested over all supported platforms x86,x86_64/RHEL5,RHEL6 with: condor-wallaby-client-4.1-4 ConfigLog: 08/02 15:00:15 DEBUG: The system is already running configuration version "1312287737712414" 08/02 15:00:17 DEBUG: Lost connection to the configuration store 08/02 15:00:25 DEBUG: Checking version of configuration 08/02 15:00:25 ERROR: Failed to contact configuration store 08/02 15:00:27 DEBUG: Established connection to the configuration store 08/02 15:00:27 DEBUG: Lost connection to the configuration store 08/02 15:00:35 DEBUG: Checking version of configuration 08/02 15:00:35 ERROR: Failed to contact configuration store 08/02 15:00:37 DEBUG: Established connection to the configuration store 08/02 15:00:37 DEBUG: Lost connection to the configuration store 08/02 15:00:45 DEBUG: Checking version of configuration 08/02 15:00:45 ERROR: Failed to contact configuration store 08/02 15:00:47 DEBUG: Established connection to the configuration store 08/02 15:00:55 DEBUG: Checking version of configuration 08/02 15:00:56 DEBUG: Performing a checkin with the store 08/02 15:00:56 DEBUG: Checked in with the store 08/02 15:00:56 DEBUG: The system is already running configuration version "1312287737712414" No such error message found after thousands runs. Wallaby client now handle periodic update thread properly. >>> VERIFIED
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1249.html