Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 621137 - Condor master crash when reconfiguring with not-existing daemons
Condor master crash when reconfiguring with not-existing daemons
Status: CLOSED ERRATA
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor (Show other bugs)
1.3
All Linux
high Severity high
: 1.3.2
: ---
Assigned To: Robert Rati
Luigi Toscano
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-08-04 07:15 EDT by Luigi Toscano
Modified: 2011-02-15 07:12 EST (History)
1 user (show)

See Also:
Fixed In Version: condor-7.4.5-0.1
Doc Type: Bug Fix
Doc Text:
The condor_master daemon could have crashed on restart. This occurred, when the daemon deleted not-running daemons from its list (while trying to start them), because the daemon did not cancel the restart timer. When the timer later continued, the daemon crashed. With this fix, condor_master stops the restart timer when an entry is removed from the DAEMON_LIST and the issue no longer occurs.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-02-15 07:12:57 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0217 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging and Grid bug fix and enhancement update 2011-02-15 07:10:15 EST

  None (edit)
Description Luigi Toscano 2010-08-04 07:15:00 EDT
Description of problem:
- Add to DAEMON_LIST a reference to a daemon which is not installed (i.e. TRIGGERD).
- Execute condor_reconfigure -master. Master will reload the daemons and complain about the missing executable.
- Remove the missing daemon from the list.
- Execute condor_reconfigure -master. 
- Wait (at most) ~2 minutes. See condor_master happily crashing:

Stack dump for process 17201 at timestamp 1280918570 (9 frames)
condor_master(dprintf_dump_stack+0x4e)[0x48bede]
condor_master[0x48dd42]
/lib64/libpthread.so.0[0x3cbe60eb10]
condor_master(_ZN6daemon5StartEb+0x84)[0x4679a4]
condor_master(_ZN12TimerManager7TimeoutEv+0x155)[0x4898f5]
condor_master(_ZN10DaemonCore6DriverEv+0x248)[0x472078]
condor_master(main+0xdc1)[0x485891]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x3cbda1d994]
condor_master(__gxx_personality_v0+0x369)[0x461ef9]

Version-Release number of selected component (if applicable):
condor-7.4.4-0.7
Comment 1 Luigi Toscano 2010-08-04 07:33:19 EDT
Another possible stack traces:

-------------------
condor_master(dprintf_dump_stack+0x44)[0x80c9074]
condor_master[0x80caea4]
[0xbd0420]
/lib/libc.so.6[0xa6dd17]
/lib/libc.so.6(__libc_malloc+0x67)[0xa6fe97]
/usr/lib/libstdc++.so.6(_Znwj+0x27)[0x530ab7]
/usr/lib/libstdc++.so.6(_Znaj+0x1d)[0x530bed]
condor_master(_ZN8ExtArrayIN10KillFamily5a_pidEEC1Ei+0x51)[0x8102b71]
condor_master(_ZN10KillFamily12takesnapshotEv+0x37)[0x8101d47]
condor_master(_ZN12TimerManager7TimeoutEv+0x14b)[0x80c6dbb]
condor_master(_ZN10DaemonCore6DriverEv+0x244)[0x80af1d4]
condor_master(main+0xd80)[0x80c2d00]
/lib/libc.so.6(__libc_start_main+0xdc)[0xa19e9c]

---------------------
condor_master(dprintf_dump_stack+0x4e)[0x48bede]
condor_master[0x48dd42]
/lib64/libpthread.so.0[0x3cbe60eb10]
condor_master(condor_basename+0x23)[0x490733]
condor_master(_ZN6daemon9RealStartEv+0xa5)[0x467155]
condor_master(_ZN12TimerManager7TimeoutEv+0x155)[0x4898f5]
condor_master(_ZN10DaemonCore6DriverEv+0x248)[0x472078]
condor_master(main+0xdc1)[0x485891]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x3cbda1d994]
condor_master(__gxx_personality_v0+0x369)[0x461ef9]
---------------------

Reproduced on both RHEL5.5, 32/64bits, and RHEL4.8 (at least 32 bit).
Comment 2 Robert Rati 2010-11-01 17:22:07 EDT
The issue was with the restart timers the condor_master uses to restart daemons that crash/fail to start.  When condor shuts down a daemon that is not currently running (but it is trying to start), it wouldn't cancel the restart timer but would delete the daemon from its list of known daemons.  When the timer would fire off sometime later, a crash or odd restart messages in MasterLog would be the result.  The fix was to cancel all timers before deleting all knowledge of the daemon from the master's memory.
Comment 3 Matthew Farrellee 2010-11-18 14:25:31 EST
FH as UPSTREAM-7.5.5-BZ621137-master-crash-with-bad-daemon, slated for post 7.4.4-0.17
Comment 4 Robert Rati 2010-11-29 13:40:25 EST
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
C: Reconfigure the master daemon after editing DAEMON_LIST to remove a daemon that wasn't running.
C: The condor_master daemon crashes
F: The condor master now stops the restart timer when an entry is removed from the DAEMON_LIST.
R: Reconfiguring condor after editing the DAEMON_LIST to remove a daemon that isn't running will not cause the master to crash.
Comment 6 Luigi Toscano 2011-02-01 12:34:09 EST
condor_master does not crash anymore in the described scenario.
Verified on RHEL 4.9 beta (20110127) / RHEL 5.6, i386/x86_64.
condor-7.4.5-0.7
Comment 7 Eva Kopalova 2011-02-09 11:47:01 EST
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,4 +1 @@
-C: Reconfigure the master daemon after editing DAEMON_LIST to remove a daemon that wasn't running.
+The condor_master daemon could have crashed on restart. This occurred, when the daemon deleted not-running daemons from its list (while trying to start them), because the daemon did not cancel the restart timer.  When the timer later continued, the deamon crashed. With this fix, condor_master stops the restart timer when an entry is removed from the DAEMON_LIST and the issue no longer occurrs.-C: The condor_master daemon crashes
-F: The condor master now stops the restart timer when an entry is removed from the DAEMON_LIST.
-R: Reconfiguring condor after editing the DAEMON_LIST to remove a daemon that isn't running will not cause the master to crash.
Comment 8 Eva Kopalova 2011-02-09 11:47:23 EST
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-The condor_master daemon could have crashed on restart. This occurred, when the daemon deleted not-running daemons from its list (while trying to start them), because the daemon did not cancel the restart timer.  When the timer later continued, the deamon crashed. With this fix, condor_master stops the restart timer when an entry is removed from the DAEMON_LIST and the issue no longer occurrs.+The condor_master daemon could have crashed on restart. This occurred, when the daemon deleted not-running daemons from its list (while trying to start them), because the daemon did not cancel the restart timer.  When the timer later continued, the daemon crashed. With this fix, condor_master stops the restart timer when an entry is removed from the DAEMON_LIST and the issue no longer occurs.
Comment 9 errata-xmlrpc 2011-02-15 07:12:57 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0217.html

Note You need to log in before you can comment on or make changes to this bug.