Description of problem: If there are management plugins in configuration and condor restarts, scheduler crashes. Version-Release number of selected component (if applicable): condor-7.4.3-0.(5|8).el5 it crashes also with condor-7.4.3-0.5.el4 condor-qmf-7.4.3-0.(5|8).el5 it crashes also with condor-qmf-7.4.3-0.5.el4 How reproducible: 100% Steps to Reproduce: 1. set up management plugins QMF_HOST=localhost SCHEDD.PLUGINS = $(LIB)/plugins/MgmtScheddPlugin-plugin.so COLLECTOR.PLUGINS = $(LIB)/plugins/MgmtCollectorPlugin-plugin.so NEGOTIATOR.PLUGINS = $(LIB)/plugins/MgmtNegotiatorPlugin-plugin.so MASTER.PLUGINS = $(LIB)/plugins/MgmtMasterPlugin-plugin.so QMF_BROKER_HOST = localhost 2. service condor restart 3. 03/30 03:17:43 (pid:27161) DaemonCore: Command Socket at <:55518> 03/30 03:17:44 (pid:27161) ClassAdLogPlugin registration succeeded 03/30 03:17:44 (pid:27161) ScheddPlugin registration succeeded 03/30 03:17:44 (pid:27161) Successfully loaded plugin: /usr/lib/condor/plugins/MgmtScheddPlugin-plugin.so 03/30 03:17:44 (pid:27161) History file rotation is enabled. 03/30 03:17:44 (pid:27161) Maximum history file size is: 20971520 bytes 03/30 03:17:44 (pid:27161) Number of rotated history files is: 2 03/30 03:17:44 (pid:27161) "/usr/sbin/condor_shadow.std -classad" did not produce any output, ignoring 03/30 03:18:14 (pid:27161) Got SIGQUIT. Performing fast shutdown. 03/30 03:18:15 (pid:27161) All shadows have been killed, exiting. 03/30 03:18:15 (pid:27161) **** condor_schedd (condor_SCHEDD) pid 27161 EXITING WITH STATUS 0 Stack dump for process 27161 at timestamp 1269933495 (9 frames) condor_schedd(dprintf_dump_stack+0xda)[0x818e13a] condor_schedd[0x818e2fa] /lib/tls/libpthread.so.0[0xd7ec98] condor_schedd(_Z12unix_sigchldi+0x19)[0x8184a1d] /lib/tls/libpthread.so.0[0xd7ec98] /usr/lib/libqpidcommon.so.2(_ZN4qpid3sys6Poller3runEv+0x66)[0x429c4c] /usr/lib/libqpidcommon.so.2(_ZN4qpid3sys54_GLOBAL__N_qpid_sys_posix_Thread.cpp_CF4CBC2E_C6AED78B11runRunnableEPv+0x11)[0x41b991] /lib/tls/libpthread.so.0[0xd785cc] /lib/tls/libc.so.6(__clone+0x5e)[0xcd0f0e] Actual results: Schedd crashes during condor restart Expected results: Schedd doesn't crash during condor restart
Is this reproducible with both QMF_DELETE_ON_SHUTDOWN = TRUE and FALSE?
It doesn't raise this exception with QMF_DELETE_ON_SHUTDOWN = TRUE or FALSE
I've tested it 100 times with TRUE and 100 times with FALSE on x86_64/i386 x RHEL5.5/RHEL4.8.
Is this still an issue?
I've retested it with condor-7.4.3-0.11.el4 qpid-cpp-server-0.7.935473-1.el4 on RHEL 4 and with condor-7.4.3-0.13.el5 and qpid-cpp-server-0.7.939184-1.el5 and it works fine without any crash of scheduler.
trusnak still sees a problem with 0.17... 155. 06/03 06:44:38 (pid:14973) Attempting to send update via UDP to collector ibm-x3650-03.ovirt.rhts.eng.bos.redhat.com <10.16.68.46:9618> 156. 06/03 06:44:38 (pid:14973) Canceled/Closed 6 socket(s) at shutdown 157. 06/03 06:44:38 (pid:14973) MgmtScheddPlugin: shutting down... 158. 06/03 06:44:39 (pid:14973) MgmtScheddPlugin: shutting down... 159. 06/03 06:44:39 (pid:14973) All shadows have been killed, exiting. 160. 06/03 06:44:39 (pid:14973) **** condor_schedd (condor_SCHEDD) pid 14973 EXITING WITH STATUS 0 161. Stack dump for process 14973 at timestamp 1275561879 (8 frames) 162. condor_schedd(dprintf_dump_stack+0x44)[0x817d344] 163. condor_schedd[0x817f0a4] 164. [0x763420] 165. [0x763420] 166. /usr/lib/libqpidcommon.so.2(_ZN4qpid3sys6Poller3runEv+0x66)[0x10bd596] 167. /usr/lib/libqpidcommon.so.2[0x10b3931] 168. /lib/libpthread.so.0[0x93f832] 169. /lib/libc.so.6(clone+0x5e)[0x6b0e0e] Appears to be a double call to shutdown???
Added skip bool to shutdown to avoid multiple deletions in shutdown by forked processes
The bug that wouldn't die... 0.18 has the single shutdown change but something is still wrong: # 06/04 09:49:18 (pid:11926) Canceled/Closed 5 socket(s) at shutdown # 06/04 09:49:18 (pid:11926) MgmtScheddPlugin: shutting down... # 06/04 09:49:19 (pid:11926) All shadows have been killed, exiting. # Stack dump for process 11926 at timestamp 1275659359 (8 frames) # condor_schedd(dprintf_dump_stack+0x44)[0x817d4a4] # condor_schedd[0x817f204] # [0x774420] # [0x774420] # /usr/lib/libqpidcommon.so.2(_ZN4qpid3sys6Poller3runEv+0x66)[0x1035796] # /usr/lib/libqpidcommon.so.2[0x102bb31] # /lib/libpthread.so.0[0x6ac832] # /lib/libc.so.6(clone+0x5e)[0x601e0e] # 06/04 09:49:19 (pid:11926) **** condor_schedd (condor_SCHEDD) pid 11926 EXITING WITH STATUS 0
Sheesh, die already... Now we won't delete singleton and try to check that we have an instance before polling callbacks. If this doesn't work its gotta go over to M.
There are two different similar bugs - 596210 534073. In each bug schedd is crashing in slightly different way. As I think, the good idea is to wait for resolution "Schedd is not crashing" and verify it all, after. I retested it in 596210, and the result is schedd is still crashing on current packages: qpid-cpp-server-0.7.946106-2.el5 qpid-java-common-0.7.946106-3.el5 qpid-cpp-server-xml-0.7.946106-2.el5 qpid-cpp-client-devel-docs-0.7.946106-2.el5 condor-qmf-7.4.3-0.18.el5 qpid-cpp-server-store-0.7.946106-2.el5 qpid-cpp-client-ssl-0.7.946106-2.el5 python-qmf-0.7.946106-3.el5 qmf-devel-0.7.946106-2.el5 qpid-tests-0.7.946106-1.el5 python-condorutils-1.4-0.4.el5 qmf-0.7.946106-2.el5 qpid-cpp-client-devel-0.7.946106-2.el5 python-qpid-0.7.946106-1.el5 qpid-java-client-0.7.946106-3.el5 qpid-tools-0.7.946106-4.el5 qpid-cpp-mrg-debuginfo-0.7.946106-2.el5 rh-tests-distribution-MRG-Grid-grid_test_segfault_condor_with_qmf_bz534073-1.0-2 condor-debuginfo-7.4.3-0.18.el5 condor-wallaby-tools-2.7-0.5.el5 condor-wallaby-client-2.7-0.5.el5 qpid-cpp-client-0.7.946106-2.el5 qpid-cpp-server-ssl-0.7.946106-2.el5 qpid-cpp-server-cluster-0.7.946106-2.el5 qpid-cpp-server-devel-0.7.946106-2.el5 condor-7.4.3-0.18.el5 SchedLog: 06/04 09:49:19 (pid:11926) All shadows have been killed, exiting. Stack dump for process 11926 at timestamp 1275659359 (8 frames) condor_schedd(dprintf_dump_stack+0x44)[0x817d4a4] condor_schedd[0x817f204] [0x774420] [0x774420] /usr/lib/libqpidcommon.so.2(_ZN4qpid3sys6Poller3runEv+0x66)[0x1035796] /usr/lib/libqpidcommon.so.2[0x102bb31] /lib/libpthread.so.0[0x6ac832] /lib/libc.so.6(clone+0x5e)[0x601e0e] 06/04 09:49:19 (pid:11926) **** condor_schedd (condor_SCHEDD) pid 11926 EXITING WITH STATUS 0
Verified over all combination of RHEL4/RHEL5 and x86, x86_64 with latest packages: $CondorVersion: 7.4.4 Jun 30 2010 BuildID: RH-7.4.4-0.4.el4 PRE-RELEASE $ $CondorPlatform: X86_64-LINUX_RHEL4 $CondorVersion: 7.4.4 Jun 30 2010 BuildID: RH-7.4.4-0.4.el4 PRE-RELEASE $ $CondorPlatform: I386-LINUX_RHEL4 $ $CondorVersion: 7.4.3 Mar 29 2010 BuildID: RH-7.4.4-0.4.el5 PRE-RELEASE $ $CondorPlatform: X86_64-LINUX_RHEL5 $ $CondorVersion: 7.4.3 Mar 29 2010 BuildID: RH-7.4.4-0.4.el5 PRE-RELEASE $ $CondorPlatform: I386-LINUX_RHEL5 $ 99 restarts performed without any issue. >>> VERIFIED