Bug 596210
Summary: | condor deamons SIGSEGV with QMF plugins after restart | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Tomas Rusnak <trusnak> | ||||||||||
Component: | condor | Assignee: | Pete MacKinnon <pmackinn> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | Tomas Rusnak <trusnak> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | high | ||||||||||||
Version: | Development | CC: | iboverma, jneedle, matt, pmackinn, rrati | ||||||||||
Target Milestone: | 1.3 | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | All | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2010-10-20 11:33:40 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | 534073, 578137, 606761, 610773, 625450 | ||||||||||||
Bug Blocks: | |||||||||||||
Attachments: |
|
Description
Tomas Rusnak
2010-05-26 12:16:03 UTC
Created attachment 416794 [details]
condor configuration
This is only additional configuration on top of the standard one provided by clean condor install.
For simple test, it isn't required to run reproducer script. With provided configuration the daemons segfaulted after simple restart, too. Trace for condor_schedd: rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE KILL SEGV STOP PROF RTMIN RT_1], NULL, 8) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0 rt_sigprocmask(SIG_BLOCK, ~[ILL TRAP ABRT BUS FPE SEGV RTMIN RT_1], ~[ILL TRAP ABRT BUS FPE KILL SEGV STOP PROF RTMIN RT_1], 8) = 0 umask(022) = 022 open("/var/log/condor/SchedLog", O_WRONLY|O_CREAT|O_EXCL|O_APPEND, 0644) = -1 EEXIST (File exists) open("/var/log/condor/SchedLog", O_WRONLY|O_APPEND) = 11 fcntl(11, F_GETFL) = 0x8401 (flags O_WRONLY|O_APPEND|O_LARGEFILE) fstat(11, {st_mode=S_IFREG|0644, st_size=6541, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaab39f000 lseek(11, 0, SEEK_CUR) = 0 lseek(11, 0, SEEK_END) = 6541 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0 getpid() = 30198 write(11, "05/26 11:34:44 (pid:30198) MgmtS"..., 62) = 62 close(11) = 0 munmap(0x2aaaab39f000, 4096) = 0 umask(022) = 022 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE KILL SEGV STOP PROF RTMIN RT_1], NULL, 8) = 0 clock_gettime(CLOCK_REALTIME, {1274888084, 490670000}) = 0 futex(0x1dea759c, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x1dea7570, 4) = 1 futex(0x1dea7570, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x1dea759c, FUTEX_WAIT_PRIVATE, 5, NULL) = 0 futex(0x1dea7570, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x1e0cce78, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x40e3e9d0, FUTEX_WAIT, 30209, NULL) = 0 futex(0x4262d9d0, FUTEX_WAIT, 30210, NULL) = 0 --- SIGSEGV (Segmentation fault) @ 0 (0) --- geteuid() = 64 getegid() = 64 open("/var/log/condor/SchedLog", O_WRONLY|O_CREAT|O_EXCL|O_APPEND, 0644) = -1 EEXIST (File exists) open("/var/log/condor/SchedLog", O_WRONLY|O_APPEND) = 11 futex(0x3f11954fc0, FUTEX_WAKE_PRIVATE, 2147483647) = 0 getpid() = 30198 write(11, "S", 1) = 1 write(11, "t", 1) = 1 write(11, "a", 1) = 1 write(11, "c", 1) = 1 write(11, "k", 1) = 1 write(11, " ", 1) = 1 write(11, "d", 1) = 1 write(11, "u", 1) = 1 write(11, "m", 1) = 1 write(11, "p", 1) = 1 .... .... write(11, "8", 1) = 1 write(11, " ", 1) = 1 write(11, "a", 1) = 1 write(11, "t", 1) = 1 write(11, " ", 1) = 1 write(11, "t", 1) = 1 write(11, "i", 1) = 1 write(11, "m", 1) = 1 write(11, "e", 1) = 1 write(11, "s", 1) = 1 write(11, "t", 1) = 1 write(11, "a", 1) = 1 write(11, "m", 1) = 1 write(11, "p", 1) = 1 write(11, " ", 1) = 1 write(11, "1", 1) = 1 write(11, "2", 1) = 1 write(11, "7", 1) = 1 write(11, "4", 1) = 1 write(11, "8", 1) = 1 write(11, "8", 1) = 1 write(11, "8", 1) = 1 write(11, "0", 1) = 1 write(11, "8", 1) = 1 write(11, "4", 1) = 1 write(11, " ", 1) = 1 write(11, "(", 1) = 1 write(11, "4", 1) = 1 write(11, " ", 1) = 1 write(11, "f", 1) = 1 write(11, "r", 1) = 1 write(11, "a", 1) = 1 write(11, "m", 1) = 1 write(11, "e", 1) = 1 write(11, "s", 1) = 1 write(11, ")", 1) = 1 write(11, "\n", 1) = 1 writev(11, [{"condor_schedd", 13}, {"(", 1}, {"dprintf_dump_stack", 18}, {"+0x", 3}, {"4e", 2}, {")", 1}, {"[0x", 3}, {"54844e", 6}, {"]\n", 2}], 9) = 49 writev(11, [{"condor_schedd", 13}, {"[0x", 3}, {"54a222", 6}, {"]\n", 2}], 4) = 24 writev(11, [{"/lib64/libpthread.so.0", 22}, {"[0x", 3}, {"3f11e0eb10", 10}, {"]\n", 2}], 4) = 37 writev(11, [{"/lib64/libc.so.6", 16}, {"[0x", 3}, {"3f11952c38", 10}, {"]\n", 2}], 4) = 31 close(11) = 0 rt_sigaction(SIGSEGV, {SIG_DFL, [], SA_RESTORER, 0x3f11e0eb10}, NULL, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 tgkill(1, -1, SIGSEGV) = -1 EINVAL (Invalid argument) rt_sigreturn(0x1) = 270877928520 --- SIGSEGV (Segmentation fault) @ 0 (0) --- Process 30198 detached Marked as Testblocker, because this really slow us down. There is no possibility to test other things before correct condor shutdown is resolved. FH sha 22c641b Overzealous deletion of QMF objects root cause *** Bug 595747 has been marked as a duplicate of this bug. *** Due to info in changelog it seems to be repaired in current version: * Thu May 27 2010 <matt@redhat> - 7.4.3-0.16 - Updated QMF package to 22c641b4: BZs - 596210 Reproduced on RHEL5/i386: 05/31 05:17:05 ** condor_master (CONDOR_MASTER) STARTING UP 05/31 05:17:05 ** /usr/sbin/condor_master 05/31 05:17:05 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1) 05/31 05:17:05 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON 05/31 05:17:05 ** $CondorVersion: 7.4.3 May 27 2010 BuildID: RH-7.4.3-0.16.el5 PRE-RELEASE $ 05/31 05:17:05 ** $CondorPlatform: I386-LINUX_RHEL5 $ 05/31 05:17:05 ** PID = 14386 05/31 05:17:05 ** Log last touched 5/31 05:17:03 05/31 05:17:05 ****************************************************** 05/31 05:17:05 Using config source: /etc/condor/condor_config 05/31 05:17:05 Using local config sources: 05/31 05:17:05 /var/lib/condor/condor_config.local 05/31 05:17:05 /var/lib/condor/config/99configd.config 05/31 05:17:05 /var/lib/condor/config/condor_config.test 05/31 05:17:05 /var/lib/condor/config/condor_config.test.default 05/31 05:17:05 DaemonCore: Command Socket at <10.16.64.50:47509> 05/31 05:17:05 MasterPlugin registration succeeded 05/31 05:17:05 Successfully loaded plugin: /usr/lib/condor/plugins/MgmtMasterPlugin-plugin.so 05/31 05:17:05 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 14398 05/31 05:17:08 Started DaemonCore process "/usr/sbin/condor_negotiator", pid and pgroup = 14412 05/31 05:17:08 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 14413 05/31 05:17:08 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 14414 05/31 05:17:08 Started process "/usr/sbin/condor_configd", pid and pgroup = 14415 05/31 05:18:17 Got SIGQUIT. Performing fast shutdown. 05/31 05:18:17 Sent SIGQUIT to COLLECTOR (pid 14398) 05/31 05:18:17 Sent SIGQUIT to NEGOTIATOR (pid 14412) 05/31 05:18:17 Sent SIGQUIT to QMF_CONFIGD (pid 14415) 05/31 05:18:17 Sent SIGQUIT to SCHEDD (pid 14413) 05/31 05:18:17 Sent SIGQUIT to STARTD (pid 14414) 05/31 05:18:17 The STARTD (pid 14414) exited with status 0 05/31 05:18:17 The QMF_CONFIGD (pid 14415) exited with status 0 05/31 05:18:17 The SCHEDD (pid 14413) died due to signal 11 (Segmentation fault) 05/31 05:18:18 The NEGOTIATOR (pid 14412) exited with status 0 05/31 05:18:18 The COLLECTOR (pid 14398) exited with status 0 05/31 05:18:18 All daemons are gone. Exiting. 05/31 05:18:19 **** condor_master (condor_MASTER) pid 14386 EXITING WITH STATUS 0 SCHEDD still crash with signal 11 - Segmentation fault. Other daemons seems to be fixed. Backtrace from SchedLog please. I haven't been able to reproduce this locally or on mrg27 yet. Were there any condor_submits done against this schedd before the crash? There was no running job in my test. It only toggle restart condor and/or qpid at random time. Today I updated qmf packages and test it again: qpid-java-client-0.7.946106-3.el5 qpid-cpp-client-devel-docs-0.7.946106-1.el5 condor-wallaby-client-2.7-0.4.el5 condor-debuginfo-7.4.3-0.16.el5 qmf-0.7.946106-2.el5 qmf-devel-0.7.946106-2.el5 qpid-cpp-server-cluster-0.7.946106-2.el5 qpid-java-common-0.7.946106-3.el5 python-condorutils-1.4-0.3.el5 condor-wallaby-tools-2.7-0.4.el5 qpid-tests-0.7.946106-1.el5 condor-qmf-7.4.3-0.16.el5 qpid-cpp-server-0.7.946106-2.el5 qpid-cpp-client-ssl-0.7.946106-2.el5 qpid-cpp-mrg-debuginfo-0.7.946106-2.el5 qpid-cpp-server-ssl-0.7.946106-2.el5 qpid-cpp-server-devel-0.7.946106-2.el5 python-qpid-0.7.946106-1.el5 python-qmf-0.7.946106-3.el5 qpid-tools-0.7.946106-4.el5 qpid-cpp-server-store-0.7.946106-1.el5 condor-7.4.3-0.16.el5 qpid-cpp-client-0.7.946106-2.el5 qpid-cpp-client-devel-0.7.946106-2.el5 qpid-cpp-server-xml-0.7.946106-2.el5 MasterLog: 06/01 11:01:31 Reading from /proc/cpuinfo 06/01 11:01:31 Found: Physical-IDs:False; Core-IDs:False 06/01 11:01:31 Using processor count: 2 processors, 2 CPUs, 0 HTs 06/01 11:01:31 Reading condor configuration from '/etc/condor/condor_config' 06/01 11:01:31 ****************************************************** 06/01 11:01:31 ** condor_master (CONDOR_MASTER) STARTING UP 06/01 11:01:31 ** /usr/sbin/condor_master 06/01 11:01:31 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1) 06/01 11:01:31 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON 06/01 11:01:31 ** $CondorVersion: 7.4.3 May 27 2010 BuildID: RH-7.4.3-0.16.el5 PRE-RELEASE $ 06/01 11:01:31 ** $CondorPlatform: I386-LINUX_RHEL5 $ 06/01 11:01:31 ** PID = 1220 06/01 11:01:31 ** Log last touched 6/1 08:06:40 06/01 11:01:31 ****************************************************** 06/01 11:01:31 Using config source: /etc/condor/condor_config 06/01 11:01:31 Using local config sources: 06/01 11:01:31 /var/lib/condor/condor_config.local 06/01 11:01:31 /var/lib/condor/config/99configd.config 06/01 11:01:31 /var/lib/condor/config/condor_config.test 06/01 11:01:31 /var/lib/condor/config/condor_config.test.default 06/01 11:01:31 Attempting to lock /var/lock/condor/InstanceLock. 06/01 11:01:31 FileLock object is updating timestamp on: /var/lock/condor/InstanceLock 06/01 11:01:31 FileLock::obtain(1) - @1275404491.086574 lock on /var/lock/condor/InstanceLock now WRITE 06/01 11:01:31 Obtained lock on /var/lock/condor/InstanceLock. 06/01 11:01:31 DaemonCore: Command Socket at <10.16.64.50:54359> 06/01 11:01:31 Will use UDP to update collector cpq-dl380-01.rhts.eng.bos.redhat.com <10.16.64.50:9618> 06/01 11:01:31 Checking for PLUGINS config option 06/01 11:01:31 MasterPlugin registration succeeded 06/01 11:01:31 Successfully loaded plugin: /usr/lib/condor/plugins/MgmtMasterPlugin-plugin.so 06/01 11:01:31 MasterPluginManager::Initialize 06/01 11:01:31 MasterPlugin initializing 1 plugins 06/01 11:01:31 MgmtMasterPlugin: Initializing... Stack dump for process 1220 at timestamp 1275404491 (11 frames) condor_master(dprintf_dump_stack+0x44)[0x80c8d34] condor_master[0x80caa94] [0x8e4420] /usr/lib/libqmf.so.1(_ZN4qpid10management19ManagementAgentImpl4initERKNS0_18ConnectionSettingsEtbRKSs+0x4c)[0x27395c] /usr/lib/condor/plugins/MgmtMasterPlugin-plugin.so(_ZN3qmf3com6redhat4grid6Master12registerSelfEPN4qpid10management15ManagementAgentE+0x38)[0x13c548] /usr/lib/condor/plugins/MgmtMasterPlugin-plugin.so(_ZN16MgmtMasterPlugin10initializeEv+0x75)[0x144a25] condor_master(_ZN19MasterPluginManager10InitializeEv+0x6c)[0x80a552c] condor_master(_Z9main_initiPPc+0x194)[0x809f604] condor_master(main+0xd73)[0x80c29b3] /lib/libc.so.6(__libc_start_main+0xdc)[0x2fce9c] condor_master[0x809dc11] SchedLog was not touched by any process - no change in it. It crash before schedd start. It is still needed SchedLog from previous version of qmf? Believed to be due to a build mismatch between qpid-cpp-client-0.7.946106-2.el5 and condor-qmf-7.4.3-0.16.el5 Retested on: qpid-cpp-server-0.7.946106-2.el5 qpid-java-common-0.7.946106-3.el5 qpid-cpp-server-xml-0.7.946106-2.el5 qpid-cpp-client-devel-docs-0.7.946106-2.el5 condor-qmf-7.4.3-0.18.el5 qpid-cpp-server-store-0.7.946106-2.el5 qpid-cpp-client-ssl-0.7.946106-2.el5 python-qmf-0.7.946106-3.el5 qmf-devel-0.7.946106-2.el5 qpid-tests-0.7.946106-1.el5 python-condorutils-1.4-0.4.el5 qmf-0.7.946106-2.el5 qpid-cpp-client-devel-0.7.946106-2.el5 python-qpid-0.7.946106-1.el5 qpid-java-client-0.7.946106-3.el5 qpid-tools-0.7.946106-4.el5 qpid-cpp-mrg-debuginfo-0.7.946106-2.el5 rh-tests-distribution-MRG-Grid-grid_test_segfault_condor_with_qmf_bz534073-1.0-2 condor-debuginfo-7.4.3-0.18.el5 condor-wallaby-tools-2.7-0.5.el5 condor-wallaby-client-2.7-0.5.el5 qpid-cpp-client-0.7.946106-2.el5 qpid-cpp-server-ssl-0.7.946106-2.el5 qpid-cpp-server-cluster-0.7.946106-2.el5 qpid-cpp-server-devel-0.7.946106-2.el5 condor-7.4.3-0.18.el5 MasterLog: 06/04 09:49:18 NumberOfChildren() returning 2 06/04 09:49:19 DaemonCore: No more children processes to reap. 06/04 09:49:19 The SCHEDD (pid 11926) died due to signal 11 (Segmentation fault) 06/04 09:49:19 ProcAPI::buildFamily failed: parent 11926 not found on system. 06/04 09:49:19 ProcAPI::getProcInfo() pid 11926 does not exist. 06/04 09:49:19 ProcAPI::getProcInfo() pid 11926 does not exist. 06/04 09:49:19 ProcAPI::getProcInfo() pid 11926 does not exist. 06/04 09:49:19 ProcAPI::getProcInfo() pid 11926 does not exist. 06/04 09:49:19 ProcAPI::getProcInfo() pid 11926 does not exist. 06/04 09:49:19 ProcAPI::getProcInfo() pid 11929 does not exist. 06/04 09:49:19 ProcAPI::getProcInfo() pid 11929 does not exist. 06/04 09:49:19 ProcAPI::getProcInfo() pid 11929 does not exist. 06/04 09:49:19 ProcAPI::getProcInfo() pid 11929 does not exist. 06/04 09:49:19 ProcAPI::getProcInfo() pid 11929 does not exist. 06/04 09:49:19 NumberOfChildren() returning 1 06/04 09:49:26 DaemonCore: No more children processes to reap. 06/04 09:49:26 The STARTD (pid 11927) exited with status 0 06/04 09:49:26 ProcAPI::buildFamily failed: parent 11927 not found on system. 06/04 09:49:26 ProcAPI::getProcInfo() pid 11927 does not exist. 06/04 09:49:26 ProcAPI::getProcInfo() pid 11927 does not exist. 06/04 09:49:26 ProcAPI::getProcInfo() pid 11927 does not exist. 06/04 09:49:26 ProcAPI::getProcInfo() pid 11927 does not exist. 06/04 09:49:26 ProcAPI::getProcInfo() pid 11927 does not exist. 06/04 09:49:26 ProcAPI::getProcInfo() pid 11932 does not exist. 06/04 09:49:26 ProcAPI::getProcInfo() pid 11932 does not exist. 06/04 09:49:26 ProcAPI::getProcInfo() pid 11932 does not exist. 06/04 09:49:26 ProcAPI::getProcInfo() pid 11932 does not exist. 06/04 09:49:26 ProcAPI::getProcInfo() pid 11932 does not exist. 06/04 09:49:26 ProcAPI::getProcInfo() pid 11936 does not exist. 06/04 09:49:26 ProcAPI::getProcInfo() pid 11936 does not exist. 06/04 09:49:26 ProcAPI::getProcInfo() pid 11936 does not exist. 06/04 09:49:26 ProcAPI::getProcInfo() pid 11936 does not exist. 06/04 09:49:26 ProcAPI::getProcInfo() pid 11936 does not exist. 06/04 09:49:26 NumberOfChildren() returning 0 06/04 09:49:26 All daemons are gone. Exiting. 06/04 09:49:26 MasterPluginManager::Shutdown 06/04 09:49:26 MgmtMasterPlugin: shutting down... 06/04 09:49:26 **** condor_master (condor_MASTER) pid 11909 EXITING WITH STATUS 0 SchedLog: 06/04 09:49:18 (pid:11926) condor_read() failed: recv() returned -1, errno = 104 Connection reset by peer, reading 5 bytes from <10.16.66.144:9618>. 06/04 09:49:18 (pid:11926) IO: Failed to read packet header 06/04 09:49:18 (pid:11926) Failed to read ClassAd size. 06/04 09:49:18 (pid:11926) SECMAN: no classad from server, failing 06/04 09:49:18 (pid:11926) ERROR: SECMAN:2004:Failed to create security session to <10.16.66.144:9618> with TCP.|SECMAN:2007:Failed to end classad message. 06/04 09:49:18 (pid:11926) Canceled/Closed 5 socket(s) at shutdown 06/04 09:49:18 (pid:11926) MgmtScheddPlugin: shutting down... 06/04 09:49:19 (pid:11926) All shadows have been killed, exiting. Stack dump for process 11926 at timestamp 1275659359 (8 frames) condor_schedd(dprintf_dump_stack+0x44)[0x817d4a4] condor_schedd[0x817f204] [0x774420] [0x774420] /usr/lib/libqpidcommon.so.2(_ZN4qpid3sys6Poller3runEv+0x66)[0x1035796] /usr/lib/libqpidcommon.so.2[0x102bb31] /lib/libpthread.so.0[0x6ac832] /lib/libc.so.6(clone+0x5e)[0x601e0e] 06/04 09:49:19 (pid:11926) **** condor_schedd (condor_SCHEDD) pid 11926 EXITING WITH STATUS 0 The scheduler still crashing with segfault with latest condor release. Created attachment 421764 [details]
strace from schedd at the crash
Strace from schedd - tested version: condor-7.4.3-0.18.el5
Created attachment 424077 [details]
strace from schedd at the crash
Retest on latest version of packages:
python-qpid-0.7.946106-1.el5
qpid-java-client-0.7.946106-3.el5
qpid-cpp-server-0.7.946106-2.el5
python-condorutils-1.4-0.6.el5
condor-wallaby-client-2.9-0.1.el5
qpid-cpp-client-devel-0.7.946106-2.el5
qpid-cpp-server-ssl-0.7.946106-2.el5
qpid-cpp-mrg-debuginfo-0.7.946106-2.el5
qpid-java-common-0.7.946106-3.el5
qpid-cpp-client-ssl-0.7.946106-2.el5
qpid-cpp-server-xml-0.7.946106-2.el5
qpid-cpp-server-cluster-0.7.946106-2.el5
condor-wallaby-tools-2.9-0.1.el5
condor-7.4.3-0.19.el5
python-qmf-0.7.946106-3.el5
qpid-cpp-client-0.7.946106-2.el5
qmf-devel-0.7.946106-2.el5
qpid-tests-0.7.946106-1.el5
condor-qmf-7.4.3-0.19.el5
qpid-cpp-client-devel-docs-0.7.946106-2.el5
qpid-tools-0.7.946106-4.el5
qmf-0.7.946106-2.el5
qpid-cpp-server-store-0.7.946106-2.el5
qpid-cpp-server-devel-0.7.946106-2.el5
condor-debuginfo-7.4.3-0.19.el5
06/15 04:41:40 (pid:23088) -------- Begin starting jobs --------
06/15 04:41:40 (pid:23088) -------- Done starting jobs --------
06/15 04:41:53 (pid:23088) Got SIGQUIT. Performing fast shutdown.
06/15 04:41:53 (pid:23088) Now in shutdown_fast. Sending signals to shadows
06/15 04:41:53 (pid:23088) ScheddCronMgr: Shutting down
06/15 04:41:53 (pid:23088) CronMgr: Killing all jobs
06/15 04:41:53 (pid:23088) Trying to update collector <10.16.66.146:9618>
06/15 04:41:53 (pid:23088) Attempting to send update via UDP to collector hp-dl160g6-01.rhts.bos.redhat.com <10.16.66.146:9618>
06/15 04:41:53 (pid:23088) Canceled/Closed 5 socket(s) at shutdown
06/15 04:41:53 (pid:23088) MgmtScheddPlugin: shutting down...
06/15 04:41:53 (pid:23088) All shadows have been killed, exiting.
Stack dump for process 23088 at timestamp 1276591313 (8 frames)
condor_schedd(dprintf_dump_stack+0x44)[0x817d7d4]
06/15 04:41:53 (pid:23088) **** condor_schedd (condor_SCHEDD) pid 23088 EXITING WITH STATUS 0
condor_schedd[0x817f534]
[0x240420]
[0x240420]
/usr/lib/libqpidcommon.so.2(_ZN4qpid3sys6Poller3runEv+0x66)[0x499796]
/usr/lib/libqpidcommon.so.2[0x48fb31]
/lib/libpthread.so.0[0x94b832]
/lib/libc.so.6(clone+0x5e)[0x8a0e0e]
See attached strace.log from schedd. It's still crashing with SIGSEGV.
Retest done on current condor-7.4.3-0.20.el5. The results on all platforms are the same as before. Same crash with same error message. Build into 7.4.4-0.1, also a fix into qpid, either is sufficient Verified over all combination of RHEL4/RHEL5 and x86, x86_64 with latest
packages:
$CondorVersion: 7.4.4 Jun 30 2010 BuildID: RH-7.4.4-0.4.el4 PRE-RELEASE $
$CondorPlatform: X86_64-LINUX_RHEL4
$CondorVersion: 7.4.4 Jun 30 2010 BuildID: RH-7.4.4-0.4.el4 PRE-RELEASE $
$CondorPlatform: I386-LINUX_RHEL4 $
$CondorVersion: 7.4.3 Mar 29 2010 BuildID: RH-7.4.4-0.4.el5 PRE-RELEASE $
$CondorPlatform: X86_64-LINUX_RHEL5 $
$CondorVersion: 7.4.3 Mar 29 2010 BuildID: RH-7.4.4-0.4.el5 PRE-RELEASE $
$CondorPlatform: I386-LINUX_RHEL5 $
99 restarts performed without any issue.
>>> VERIFIED
|