Bug 578137 - Scheduler crashes during condor restart
Summary: Scheduler crashes during condor restart
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: Development
Hardware: All
OS: Linux
medium
medium
Target Milestone: 1.3
: ---
Assignee: Pete MacKinnon
QA Contact: Tomas Rusnak
URL:
Whiteboard:
Depends On: 606761 610773
Blocks: 596210
TreeView+ depends on / blocked
 
Reported: 2010-03-30 11:15 UTC by Martin Kudlej
Modified: 2010-10-20 11:29 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-10-20 11:29:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Martin Kudlej 2010-03-30 11:15:57 UTC
Description of problem:
If there are management plugins in configuration and condor restarts, scheduler crashes.

Version-Release number of selected component (if applicable):
condor-7.4.3-0.(5|8).el5 it crashes also with condor-7.4.3-0.5.el4 
condor-qmf-7.4.3-0.(5|8).el5 it crashes also with condor-qmf-7.4.3-0.5.el4

How reproducible:
100%

Steps to Reproduce:
1. set up management plugins
QMF_HOST=localhost
SCHEDD.PLUGINS = $(LIB)/plugins/MgmtScheddPlugin-plugin.so
COLLECTOR.PLUGINS = $(LIB)/plugins/MgmtCollectorPlugin-plugin.so
NEGOTIATOR.PLUGINS = $(LIB)/plugins/MgmtNegotiatorPlugin-plugin.so
MASTER.PLUGINS = $(LIB)/plugins/MgmtMasterPlugin-plugin.so
QMF_BROKER_HOST = localhost

2. service condor restart
3. 
03/30 03:17:43 (pid:27161) DaemonCore: Command Socket at <:55518>
03/30 03:17:44 (pid:27161) ClassAdLogPlugin registration succeeded
03/30 03:17:44 (pid:27161) ScheddPlugin registration succeeded
03/30 03:17:44 (pid:27161) Successfully loaded plugin: /usr/lib/condor/plugins/MgmtScheddPlugin-plugin.so
03/30 03:17:44 (pid:27161) History file rotation is enabled.
03/30 03:17:44 (pid:27161)   Maximum history file size is: 20971520 bytes
03/30 03:17:44 (pid:27161)   Number of rotated history files is: 2
03/30 03:17:44 (pid:27161) "/usr/sbin/condor_shadow.std -classad" did not produce any output, ignoring
03/30 03:18:14 (pid:27161) Got SIGQUIT.  Performing fast shutdown.
03/30 03:18:15 (pid:27161) All shadows have been killed, exiting.
03/30 03:18:15 (pid:27161) **** condor_schedd (condor_SCHEDD) pid 27161 EXITING WITH STATUS 0
Stack dump for process 27161 at timestamp 1269933495 (9 frames)
condor_schedd(dprintf_dump_stack+0xda)[0x818e13a]
condor_schedd[0x818e2fa]
/lib/tls/libpthread.so.0[0xd7ec98]
condor_schedd(_Z12unix_sigchldi+0x19)[0x8184a1d]
/lib/tls/libpthread.so.0[0xd7ec98]
/usr/lib/libqpidcommon.so.2(_ZN4qpid3sys6Poller3runEv+0x66)[0x429c4c]
/usr/lib/libqpidcommon.so.2(_ZN4qpid3sys54_GLOBAL__N_qpid_sys_posix_Thread.cpp_CF4CBC2E_C6AED78B11runRunnableEPv+0x11)[0x41b991]
/lib/tls/libpthread.so.0[0xd785cc]
/lib/tls/libc.so.6(__clone+0x5e)[0xcd0f0e]
  
Actual results:
Schedd crashes during condor restart

Expected results:
Schedd doesn't crash during condor restart

Comment 1 Matthew Farrellee 2010-03-30 11:33:17 UTC
Is this reproducible with both QMF_DELETE_ON_SHUTDOWN = TRUE and FALSE?

Comment 2 Martin Kudlej 2010-04-01 12:43:09 UTC
It doesn't raise this exception with QMF_DELETE_ON_SHUTDOWN = TRUE or FALSE

Comment 3 Martin Kudlej 2010-04-01 12:44:46 UTC
I've tested it 100 times with TRUE and 100 times with FALSE on x86_64/i386 x RHEL5.5/RHEL4.8.

Comment 4 Matthew Farrellee 2010-05-15 18:21:08 UTC
Is this still an issue?

Comment 5 Martin Kudlej 2010-05-21 07:24:07 UTC
I've retested it with condor-7.4.3-0.11.el4 qpid-cpp-server-0.7.935473-1.el4 on RHEL 4 and with condor-7.4.3-0.13.el5 and qpid-cpp-server-0.7.939184-1.el5 and it works fine without any crash of scheduler.

Comment 7 Pete MacKinnon 2010-06-03 12:58:37 UTC
trusnak still sees a problem with 0.17...

 155.
      06/03 06:44:38 (pid:14973) Attempting to send update via UDP to collector ibm-x3650-03.ovirt.rhts.eng.bos.redhat.com <10.16.68.46:9618>
 156.
      06/03 06:44:38 (pid:14973) Canceled/Closed 6 socket(s) at shutdown
 157.
      06/03 06:44:38 (pid:14973) MgmtScheddPlugin: shutting down...
 158.
      06/03 06:44:39 (pid:14973) MgmtScheddPlugin: shutting down...
 159.
      06/03 06:44:39 (pid:14973) All shadows have been killed, exiting.
 160.
      06/03 06:44:39 (pid:14973) **** condor_schedd (condor_SCHEDD) pid 14973 EXITING WITH STATUS 0
 161.
      Stack dump for process 14973 at timestamp 1275561879 (8 frames)
 162.
      condor_schedd(dprintf_dump_stack+0x44)[0x817d344]
 163.
      condor_schedd[0x817f0a4]
 164.
      [0x763420]
 165.
      [0x763420]
 166.
      /usr/lib/libqpidcommon.so.2(_ZN4qpid3sys6Poller3runEv+0x66)[0x10bd596]
 167.
      /usr/lib/libqpidcommon.so.2[0x10b3931]
 168.
      /lib/libpthread.so.0[0x93f832]
 169.
      /lib/libc.so.6(clone+0x5e)[0x6b0e0e] 


Appears to be a double call to shutdown???

Comment 8 Pete MacKinnon 2010-06-03 13:58:46 UTC
Added skip bool to shutdown to avoid multiple deletions in shutdown by forked processes

Comment 9 Pete MacKinnon 2010-06-04 14:06:01 UTC
The bug that wouldn't die... 0.18 has the single shutdown change but something is still wrong:

#
06/04 09:49:18 (pid:11926) Canceled/Closed 5 socket(s) at shutdown
#
06/04 09:49:18 (pid:11926) MgmtScheddPlugin: shutting down...
#
06/04 09:49:19 (pid:11926) All shadows have been killed, exiting.
#
Stack dump for process 11926 at timestamp 1275659359 (8 frames)
#
condor_schedd(dprintf_dump_stack+0x44)[0x817d4a4]
#
condor_schedd[0x817f204]
#
[0x774420]
#
[0x774420]
#
/usr/lib/libqpidcommon.so.2(_ZN4qpid3sys6Poller3runEv+0x66)[0x1035796]
#
/usr/lib/libqpidcommon.so.2[0x102bb31]
#
/lib/libpthread.so.0[0x6ac832]
#
/lib/libc.so.6(clone+0x5e)[0x601e0e]
#
06/04 09:49:19 (pid:11926) **** condor_schedd (condor_SCHEDD) pid 11926 EXITING WITH STATUS 0

Comment 10 Pete MacKinnon 2010-06-04 17:01:24 UTC
Sheesh, die already...

Now we won't delete singleton and try to check that we have an instance before polling callbacks.

If this doesn't work its gotta go over to M.

Comment 11 Tomas Rusnak 2010-06-07 11:02:23 UTC
There are two different similar bugs - 596210 534073. In each bug schedd is
crashing in slightly different way. 

As I think, the good idea is to wait for resolution "Schedd is not crashing"
and verify it all, after.

I retested it in 596210, and the result is schedd is still crashing on current
packages:

qpid-cpp-server-0.7.946106-2.el5
qpid-java-common-0.7.946106-3.el5
qpid-cpp-server-xml-0.7.946106-2.el5
qpid-cpp-client-devel-docs-0.7.946106-2.el5
condor-qmf-7.4.3-0.18.el5
qpid-cpp-server-store-0.7.946106-2.el5
qpid-cpp-client-ssl-0.7.946106-2.el5
python-qmf-0.7.946106-3.el5
qmf-devel-0.7.946106-2.el5
qpid-tests-0.7.946106-1.el5
python-condorutils-1.4-0.4.el5
qmf-0.7.946106-2.el5
qpid-cpp-client-devel-0.7.946106-2.el5
python-qpid-0.7.946106-1.el5
qpid-java-client-0.7.946106-3.el5
qpid-tools-0.7.946106-4.el5
qpid-cpp-mrg-debuginfo-0.7.946106-2.el5
rh-tests-distribution-MRG-Grid-grid_test_segfault_condor_with_qmf_bz534073-1.0-2
condor-debuginfo-7.4.3-0.18.el5
condor-wallaby-tools-2.7-0.5.el5
condor-wallaby-client-2.7-0.5.el5
qpid-cpp-client-0.7.946106-2.el5
qpid-cpp-server-ssl-0.7.946106-2.el5
qpid-cpp-server-cluster-0.7.946106-2.el5
qpid-cpp-server-devel-0.7.946106-2.el5
condor-7.4.3-0.18.el5

SchedLog:

06/04 09:49:19 (pid:11926) All shadows have been killed, exiting.
Stack dump for process 11926 at timestamp 1275659359 (8 frames)
condor_schedd(dprintf_dump_stack+0x44)[0x817d4a4]
condor_schedd[0x817f204]
[0x774420]
[0x774420]
/usr/lib/libqpidcommon.so.2(_ZN4qpid3sys6Poller3runEv+0x66)[0x1035796]
/usr/lib/libqpidcommon.so.2[0x102bb31]
/lib/libpthread.so.0[0x6ac832]
/lib/libc.so.6(clone+0x5e)[0x601e0e]
06/04 09:49:19 (pid:11926) **** condor_schedd (condor_SCHEDD) pid 11926 EXITING
WITH STATUS 0

Comment 12 Tomas Rusnak 2010-07-16 14:33:03 UTC
Verified over all combination of RHEL4/RHEL5 and x86, x86_64 with latest packages:

$CondorVersion: 7.4.4 Jun 30 2010 BuildID: RH-7.4.4-0.4.el4 PRE-RELEASE $
$CondorPlatform: X86_64-LINUX_RHEL4

$CondorVersion: 7.4.4 Jun 30 2010 BuildID: RH-7.4.4-0.4.el4 PRE-RELEASE $
$CondorPlatform: I386-LINUX_RHEL4 $

$CondorVersion: 7.4.3 Mar 29 2010 BuildID: RH-7.4.4-0.4.el5 PRE-RELEASE $
$CondorPlatform: X86_64-LINUX_RHEL5 $

$CondorVersion: 7.4.3 Mar 29 2010 BuildID: RH-7.4.4-0.4.el5 PRE-RELEASE $
$CondorPlatform: I386-LINUX_RHEL5 $

99 restarts performed without any issue.

>>> VERIFIED


Note You need to log in before you can comment on or make changes to this bug.