704597 – condor_configd hits "CRITICAL: Unable to get node information object" stops updating

Bug 704597 - condor_configd hits "CRITICAL: Unable to get node information object" stops updating

Summary: condor_configd hits "CRITICAL: Unable to get node information object" stops u...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	condor-wallaby-client
Sub Component:
Version:	1.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	2.0.1
Target Release:	---
Assignee:	Robert Rati
QA Contact:	Tomas Rusnak
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	723887
TreeView+	depends on / blocked

Reported:	2011-05-13 17:30 UTC by Matthew Farrellee
Modified:	2011-09-07 16:41 UTC (History)
CC List:	3 users (show)
Fixed In Version:	condor-wallaby-client-4.1-1
Doc Type:	Bug Fix
Doc Text:	C: A communication interruption when the condor_configd is checking in with the configuration store C: The thread controlling the periodic updating exits, and the node no longer checks in periodically with the configuration store F: Handle the exception that can arise and monitor all child threads to ensure they are running. If one is found not running, restart it. R: The configd will not stop periodically checking in
Clone Of:
Environment:
Last Closed:	2011-09-07 16:41:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2011:1249	0	normal	SHIPPED_LIVE	Moderate: Red Hat Enterprise MRG Grid 2.0 security, bug fix and enhancement update	2011-09-07 16:40:45 UTC

Description Matthew Farrellee 2011-05-13 17:30:44 UTC

condor-wallaby-client-4.0-5.el5

---

$ wallaby inventory
Console Connection Established...
                node name is provisioned?                             last checkin
                --------- ---------------                             ------------
...
 mrg26                        provisioned           Tue May 10 04:54:42 -0400 2011
...

[root@mrg26 ~]# tail /var/log/condor/ConfigLog
05/09 11:44:48 INFO: Starting Up
05/09 11:44:48 INFO: Hostname is "mrg26"
05/09 11:44:48 INFO: Cleaning up temporary configuration files
05/09 11:48:29 INFO: Retrieving configuration version "1304956105456471" from the store
05/09 11:48:31 INFO: Retrieved configuration from the store
05/09 11:48:32 INFO: Exiting
05/09 11:48:33 INFO: Starting Up
05/09 11:48:33 INFO: Hostname is "mrg26"
05/09 11:48:33 INFO: Cleaning up temporary configuration files
05/10 01:53:02 CRITICAL: Unable to get node information object

[root@mrg26 ~]# tail /var/log/condor/MasterLog
05/09/11 11:48:33 Started process "/usr/sbin/condor_startd", pid and pgroup = 11232
05/09/11 11:48:33 Started process "/usr/sbin/condor_startd", pid and pgroup = 11241
05/09/11 11:48:33 Started process "/usr/sbin/condor_startd", pid and pgroup = 11253
05/09/11 11:48:33 Started process "/usr/sbin/condor_startd", pid and pgroup = 11262
05/09/11 11:48:33 Started process "/usr/sbin/condor_configd", pid and pgroup = 11264
05/09/11 12:48:33 Preen pid is 1570
05/10/11 12:48:33 Preen pid is 20805
05/11/11 12:48:33 Preen pid is 25009
05/12/11 12:48:33 Preen pid is 29218
05/13/11 12:48:33 Preen pid is 974

---

It is unknown what caused the "Unable to get node information object" but a restart (condor_off/on -subsys configd) appears to get it back to a proper state. Possibly the configd should exit on the critical error and allow the master to restart it fresh.

Comment 2 Robert Rati 2011-05-16 15:50:36 UTC

It doesn't look like debug logging is enabled, so that CRITICAL error could have been a one time event.   That error message comes when the configd attempts to retrieve the node object from the store.  That obviously failed at one point (it looks like during a periodic check-in), but that shouldn't be a hard failure.  Was some configuration through wallaby attempted to show that the node wasn't communicating with wallaby?

I looked at mrg11's log as well as saw a number of critical errors, but none of them gave any indication of preventing the configd from running.  In fact, mrg11 has a string of CRITICAL errors on 4/28-4/30 but the config continues to run and even pulls a config later.

I'm guessing that there has been some broker issues, but I'm not seeing anything in the logs indicating that the configd stopped functioning.  I suggest enabling debug logging for the configd on all nodes in the pool.

Comment 3 Robert Rati 2011-06-23 16:51:01 UTC

If the broker is restarted, or other communication disruption, while the configd is performing it's periodic checkin it is possible that it occurred in an area that wasn't catching the exception.  The exception is now caught and handled appropriately.

Additionally, the configd now monitors both the interval timer and on windows the shutdown timer.  If either exits, it is restarted.

Fixed on BZ704597-update-timer-stop

Comment 4 Robert Rati 2011-06-23 20:55:43 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
C: A communication interruption when the condor_configd is checking in with the configuration store
C: The thread controlling the periodic updating exits, and the node no longer checks in periodically with the configuration store
F: Handle the exception that can arise and monitor all child threads to ensure they are running.  If one is found not running, restart it.
R: The configd will not stop periodically checking in

Comment 6 Tomas Rusnak 2011-08-09 14:54:09 UTC

Reproduced with:

$CondorVersion: 7.6.1 Jun 02 2011 BuildID: RH-7.6.1-0.10.el5 $
$CondorPlatform: X86_64-RedHat_5.6 $

condor-wallaby-client-4.0-6.el5

08/09 16:21:07 DEBUG: The system is already running configuration version "1312893808070215"
08/09 16:21:17 DEBUG: Checking version of configuration
08/09 16:21:21 DEBUG: Lost connection to the configuration store
08/09 16:21:21 CRITICAL: Unable to get node information object

Comment 7 Tomas Rusnak 2011-08-10 09:39:29 UTC

Retested over all supported platforms x86,x86_64/RHEL5,RHEL6 with:

condor-wallaby-client-4.1-4

ConfigLog:
08/02 15:00:15 DEBUG: The system is already running configuration version "1312287737712414"
08/02 15:00:17 DEBUG: Lost connection to the configuration store
08/02 15:00:25 DEBUG: Checking version of configuration
08/02 15:00:25 ERROR: Failed to contact configuration store
08/02 15:00:27 DEBUG: Established connection to the configuration store
08/02 15:00:27 DEBUG: Lost connection to the configuration store
08/02 15:00:35 DEBUG: Checking version of configuration
08/02 15:00:35 ERROR: Failed to contact configuration store
08/02 15:00:37 DEBUG: Established connection to the configuration store
08/02 15:00:37 DEBUG: Lost connection to the configuration store
08/02 15:00:45 DEBUG: Checking version of configuration
08/02 15:00:45 ERROR: Failed to contact configuration store
08/02 15:00:47 DEBUG: Established connection to the configuration store
08/02 15:00:55 DEBUG: Checking version of configuration
08/02 15:00:56 DEBUG: Performing a checkin with the store
08/02 15:00:56 DEBUG: Checked in with the store
08/02 15:00:56 DEBUG: The system is already running configuration version "1312287737712414"

No such error message found after thousands runs. Wallaby client now handle periodic update thread properly.

>>> VERIFIED

Comment 8 errata-xmlrpc 2011-09-07 16:41:56 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1249.html

Note You need to log in before you can comment on or make changes to this bug.