Bug 704597
Summary: | condor_configd hits "CRITICAL: Unable to get node information object" stops updating | ||
---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Matthew Farrellee <matt> |
Component: | condor-wallaby-client | Assignee: | Robert Rati <rrati> |
Status: | CLOSED ERRATA | QA Contact: | Tomas Rusnak <trusnak> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 1.3 | CC: | jneedle, mkudlej, trusnak |
Target Milestone: | 2.0.1 | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | condor-wallaby-client-4.1-1 | Doc Type: | Bug Fix |
Doc Text: |
C: A communication interruption when the condor_configd is checking in with the configuration store
C: The thread controlling the periodic updating exits, and the node no longer checks in periodically with the configuration store
F: Handle the exception that can arise and monitor all child threads to ensure they are running. If one is found not running, restart it.
R: The configd will not stop periodically checking in
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2011-09-07 16:41:56 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 723887 |
Description
Matthew Farrellee
2011-05-13 17:30:44 UTC
It doesn't look like debug logging is enabled, so that CRITICAL error could have been a one time event. That error message comes when the configd attempts to retrieve the node object from the store. That obviously failed at one point (it looks like during a periodic check-in), but that shouldn't be a hard failure. Was some configuration through wallaby attempted to show that the node wasn't communicating with wallaby? I looked at mrg11's log as well as saw a number of critical errors, but none of them gave any indication of preventing the configd from running. In fact, mrg11 has a string of CRITICAL errors on 4/28-4/30 but the config continues to run and even pulls a config later. I'm guessing that there has been some broker issues, but I'm not seeing anything in the logs indicating that the configd stopped functioning. I suggest enabling debug logging for the configd on all nodes in the pool. If the broker is restarted, or other communication disruption, while the configd is performing it's periodic checkin it is possible that it occurred in an area that wasn't catching the exception. The exception is now caught and handled appropriately. Additionally, the configd now monitors both the interval timer and on windows the shutdown timer. If either exits, it is restarted. Fixed on BZ704597-update-timer-stop Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: C: A communication interruption when the condor_configd is checking in with the configuration store C: The thread controlling the periodic updating exits, and the node no longer checks in periodically with the configuration store F: Handle the exception that can arise and monitor all child threads to ensure they are running. If one is found not running, restart it. R: The configd will not stop periodically checking in Reproduced with: $CondorVersion: 7.6.1 Jun 02 2011 BuildID: RH-7.6.1-0.10.el5 $ $CondorPlatform: X86_64-RedHat_5.6 $ condor-wallaby-client-4.0-6.el5 08/09 16:21:07 DEBUG: The system is already running configuration version "1312893808070215" 08/09 16:21:17 DEBUG: Checking version of configuration 08/09 16:21:21 DEBUG: Lost connection to the configuration store 08/09 16:21:21 CRITICAL: Unable to get node information object Retested over all supported platforms x86,x86_64/RHEL5,RHEL6 with:
condor-wallaby-client-4.1-4
ConfigLog:
08/02 15:00:15 DEBUG: The system is already running configuration version "1312287737712414"
08/02 15:00:17 DEBUG: Lost connection to the configuration store
08/02 15:00:25 DEBUG: Checking version of configuration
08/02 15:00:25 ERROR: Failed to contact configuration store
08/02 15:00:27 DEBUG: Established connection to the configuration store
08/02 15:00:27 DEBUG: Lost connection to the configuration store
08/02 15:00:35 DEBUG: Checking version of configuration
08/02 15:00:35 ERROR: Failed to contact configuration store
08/02 15:00:37 DEBUG: Established connection to the configuration store
08/02 15:00:37 DEBUG: Lost connection to the configuration store
08/02 15:00:45 DEBUG: Checking version of configuration
08/02 15:00:45 ERROR: Failed to contact configuration store
08/02 15:00:47 DEBUG: Established connection to the configuration store
08/02 15:00:55 DEBUG: Checking version of configuration
08/02 15:00:56 DEBUG: Performing a checkin with the store
08/02 15:00:56 DEBUG: Checked in with the store
08/02 15:00:56 DEBUG: The system is already running configuration version "1312287737712414"
No such error message found after thousands runs. Wallaby client now handle periodic update thread properly.
>>> VERIFIED
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1249.html |