Bug 596210

Summary: condor deamons SIGSEGV with QMF plugins after restart
Product: Red Hat Enterprise MRG Reporter: Tomas Rusnak <trusnak>
Component: condorAssignee: Pete MacKinnon <pmackinn>
Status: CLOSED ERRATA QA Contact: Tomas Rusnak <trusnak>
Severity: high Docs Contact:
Priority: high    
Version: DevelopmentCC: iboverma, jneedle, matt, pmackinn, rrati
Target Milestone: 1.3   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-10-20 11:33:40 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 534073, 578137, 606761, 610773, 625450    
Bug Blocks:    
Attachments:
Description Flags
strace from schedd at the crash
none
condor configuration
none
strace from schedd at the crash
none
strace from schedd at the crash none

Description Tomas Rusnak 2010-05-26 12:16:03 UTC
Created attachment 416789 [details]
strace from schedd at the crash

Description of problem:
There are observed SIGSEGV in all condor deamons (collector, negotiator, schedd). Setup is with QMF plugins. I restarted qpid and condor services randomly in different order in loop. See the stack dump one of the deamons (schedd) before crash on the bottom.

Version-Release number of selected component (if applicable):
condor-wallaby-client-2.7-0.4.el5
classads-1.0.6-1.el5
condor-7.4.3-0.14.el5
classads-1.0.4-1.el5
python-condorutils-1.4-0.3.el5
condor-debuginfo-7.4.3-0.14.el5
condor-wallaby-tools-2.7-0.4.el5
condor-qmf-7.4.3-0.14.el5

How reproducible:
Always

Steps to Reproduce:
1. configure condor with QMF plugins (config in attachment)
2. run the script from https://bugzilla.redhat.com/attachment.cgi?id=405940
3. take a look at logs and/or follow process by strace
  
Actual results:
Deamons are crashing

Expected results:
Deamons are not crashing

Additional info:

Thread 3 (Thread 0x42a39940 (LWP 32531)):
#0  0x0000003f116ccfc2 in select () from /lib64/libc.so.6
#1  0x0000000000490323 in sleep ()
#2  0x0000003f18a1ba52 in qpid::management::ManagementAgentImpl::PublishThread::run() () from /usr/lib64/libqmf.so.1
#3  0x0000003f16f2358a in qpid::sys::(anonymous namespace)::runRunnable(void*) () from /usr/lib64/libqpidcommon.so.2
#4  0x0000003f11e0673d in start_thread () from /lib64/libpthread.so.0
#5  0x0000003f116d3d1d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x4343a940 (LWP 32534)):
#0  0x0000003f116d4108 in epoll_wait () from /lib64/libc.so.6
#1  0x0000003f16f2bfef in qpid::sys::Poller::wait(qpid::sys::Duration) () from /usr/lib64/libqpidcommon.so.2
#2  0x0000003f16f2c9d2 in qpid::sys::Poller::run() () from /usr/lib64/libqpidcommon.so.2
#3  0x0000003f16f2358a in qpid::sys::(anonymous namespace)::runRunnable(void*) () from /usr/lib64/libqpidcommon.so.2
#4  0x0000003f11e0673d in start_thread () from /lib64/libpthread.so.0
#5  0x0000003f116d3d1d in clone () from /lib64/libc.so.6
Thread 1 (LWP 32509):
#0  0x0000003f11e07b35 in pthread_join () from /lib64/libpthread.so.0
#1  0x0000003f16f23add in qpid::sys::Thread::join() () from /usr/lib64/libqpidcommon.so.2
#2  0x0000003f18a1f883 in qpid::management::ManagementAgentImpl::~ManagementAgentImpl() () from /usr/lib64/libqmf.so.1
#3  0x0000003f18a11e2b in qpid::management::ManagementAgent::Singleton::~Singleton() () from /usr/lib64/libqmf.so.1
#4  0x00002ba5755b18aa in MgmtMasterPlugin::shutdown() () from /usr/lib64/condor/plugins/MgmtMasterPlugin-plugin.so
#5  0x0000000000469141 in MasterPluginManager::Shutdown() ()
#6  0x000000000046330d in master_exit(int) ()
#7  0x0000000000467935 in Daemons::AllReaper(int, int) ()
#8  0x000000000046baac in DaemonCore::CallReaper(int, char const*, int, int) ()
#9  0x000000000047e507 in DaemonCore::HandleProcessExit(int, int) ()
#10 0x000000000047e64e in DaemonCore::HandleDC_SERVICEWAITPIDS(int) ()
#11 0x0000000000471d57 in DaemonCore::Driver() ()
#12 0x0000000000485548 in main ()

Comment 1 Tomas Rusnak 2010-05-26 12:35:37 UTC
Created attachment 416794 [details]
condor configuration

This is only additional configuration on top of the standard one provided by clean condor install.

Comment 2 Tomas Rusnak 2010-05-26 15:40:07 UTC
For simple test, it isn't required to run reproducer script. With provided configuration the daemons segfaulted after simple restart, too.

Trace for condor_schedd:

rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE KILL SEGV STOP PROF RTMIN RT_1], NULL, 8) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0
rt_sigprocmask(SIG_BLOCK, ~[ILL TRAP ABRT BUS FPE SEGV RTMIN RT_1], ~[ILL TRAP ABRT BUS FPE KILL SEGV STOP PROF RTMIN RT_1], 8) = 0
umask(022)                              = 022
open("/var/log/condor/SchedLog", O_WRONLY|O_CREAT|O_EXCL|O_APPEND, 0644) = -1 EEXIST (File exists)
open("/var/log/condor/SchedLog", O_WRONLY|O_APPEND) = 11
fcntl(11, F_GETFL)                      = 0x8401 (flags O_WRONLY|O_APPEND|O_LARGEFILE)
fstat(11, {st_mode=S_IFREG|0644, st_size=6541, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaab39f000
lseek(11, 0, SEEK_CUR)                  = 0
lseek(11, 0, SEEK_END)                  = 6541
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0
getpid()                                = 30198
write(11, "05/26 11:34:44 (pid:30198) MgmtS"..., 62) = 62
close(11)                               = 0
munmap(0x2aaaab39f000, 4096)            = 0
umask(022)                              = 022
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE KILL SEGV STOP PROF RTMIN RT_1], NULL, 8) = 0
clock_gettime(CLOCK_REALTIME, {1274888084, 490670000}) = 0
futex(0x1dea759c, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x1dea7570, 4) = 1
futex(0x1dea7570, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x1dea759c, FUTEX_WAIT_PRIVATE, 5, NULL) = 0
futex(0x1dea7570, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x1e0cce78, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x40e3e9d0, FUTEX_WAIT, 30209, NULL) = 0
futex(0x4262d9d0, FUTEX_WAIT, 30210, NULL) = 0
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
geteuid()                               = 64
getegid()                               = 64
open("/var/log/condor/SchedLog", O_WRONLY|O_CREAT|O_EXCL|O_APPEND, 0644) = -1 EEXIST (File exists)
open("/var/log/condor/SchedLog", O_WRONLY|O_APPEND) = 11
futex(0x3f11954fc0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
getpid()                                = 30198
write(11, "S", 1)                       = 1
write(11, "t", 1)                       = 1
write(11, "a", 1)                       = 1
write(11, "c", 1)                       = 1
write(11, "k", 1)                       = 1
write(11, " ", 1)                       = 1
write(11, "d", 1)                       = 1
write(11, "u", 1)                       = 1
write(11, "m", 1)                       = 1
write(11, "p", 1)                       = 1
....
....
write(11, "8", 1)                       = 1
write(11, " ", 1)                       = 1
write(11, "a", 1)                       = 1
write(11, "t", 1)                       = 1
write(11, " ", 1)                       = 1
write(11, "t", 1)                       = 1
write(11, "i", 1)                       = 1
write(11, "m", 1)                       = 1
write(11, "e", 1)                       = 1
write(11, "s", 1)                       = 1
write(11, "t", 1)                       = 1
write(11, "a", 1)                       = 1
write(11, "m", 1)                       = 1
write(11, "p", 1)                       = 1
write(11, " ", 1)                       = 1
write(11, "1", 1)                       = 1
write(11, "2", 1)                       = 1
write(11, "7", 1)                       = 1
write(11, "4", 1)                       = 1
write(11, "8", 1)                       = 1
write(11, "8", 1)                       = 1
write(11, "8", 1)                       = 1
write(11, "0", 1)                       = 1
write(11, "8", 1)                       = 1
write(11, "4", 1)                       = 1
write(11, " ", 1)                       = 1
write(11, "(", 1)                       = 1
write(11, "4", 1)                       = 1
write(11, " ", 1)                       = 1
write(11, "f", 1)                       = 1
write(11, "r", 1)                       = 1
write(11, "a", 1)                       = 1
write(11, "m", 1)                       = 1
write(11, "e", 1)                       = 1
write(11, "s", 1)                       = 1
write(11, ")", 1)                       = 1
write(11, "\n", 1)                      = 1
writev(11, [{"condor_schedd", 13}, {"(", 1}, {"dprintf_dump_stack", 18}, {"+0x", 3}, {"4e", 2}, {")", 1}, {"[0x", 3}, {"54844e", 6}, {"]\n", 2}], 9) = 49
writev(11, [{"condor_schedd", 13}, {"[0x", 3}, {"54a222", 6}, {"]\n", 2}], 4) = 24
writev(11, [{"/lib64/libpthread.so.0", 22}, {"[0x", 3}, {"3f11e0eb10", 10}, {"]\n", 2}], 4) = 37
writev(11, [{"/lib64/libc.so.6", 16}, {"[0x", 3}, {"3f11952c38", 10}, {"]\n", 2}], 4) = 31
close(11)                               = 0
rt_sigaction(SIGSEGV, {SIG_DFL, [], SA_RESTORER, 0x3f11e0eb10}, NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
tgkill(1, -1, SIGSEGV)                  = -1 EINVAL (Invalid argument)
rt_sigreturn(0x1)                       = 270877928520
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
Process 30198 detached

Comment 3 Tomas Rusnak 2010-05-26 15:44:52 UTC
Marked as Testblocker, because this really slow us down. There is no possibility to test other things before correct condor shutdown is resolved.

Comment 4 Pete MacKinnon 2010-05-27 20:13:07 UTC
FH sha 22c641b

Overzealous deletion of QMF objects root cause

Comment 5 Pete MacKinnon 2010-05-28 15:16:36 UTC
*** Bug 595747 has been marked as a duplicate of this bug. ***

Comment 6 Tomas Rusnak 2010-05-31 09:26:03 UTC
Due to info in changelog it seems to be repaired in current version:
* Thu May 27 2010 <matt@redhat> - 7.4.3-0.16 - Updated QMF package to 22c641b4: BZs - 596210 

Reproduced on RHEL5/i386:

05/31 05:17:05 ** condor_master (CONDOR_MASTER) STARTING UP
05/31 05:17:05 ** /usr/sbin/condor_master
05/31 05:17:05 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
05/31 05:17:05 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
05/31 05:17:05 ** $CondorVersion: 7.4.3 May 27 2010 BuildID: RH-7.4.3-0.16.el5 PRE-RELEASE $
05/31 05:17:05 ** $CondorPlatform: I386-LINUX_RHEL5 $
05/31 05:17:05 ** PID = 14386
05/31 05:17:05 ** Log last touched 5/31 05:17:03
05/31 05:17:05 ******************************************************
05/31 05:17:05 Using config source: /etc/condor/condor_config
05/31 05:17:05 Using local config sources: 
05/31 05:17:05    /var/lib/condor/condor_config.local
05/31 05:17:05    /var/lib/condor/config/99configd.config
05/31 05:17:05    /var/lib/condor/config/condor_config.test
05/31 05:17:05    /var/lib/condor/config/condor_config.test.default
05/31 05:17:05 DaemonCore: Command Socket at <10.16.64.50:47509>
05/31 05:17:05 MasterPlugin registration succeeded
05/31 05:17:05 Successfully loaded plugin: /usr/lib/condor/plugins/MgmtMasterPlugin-plugin.so
05/31 05:17:05 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 14398
05/31 05:17:08 Started DaemonCore process "/usr/sbin/condor_negotiator", pid and pgroup = 14412
05/31 05:17:08 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 14413
05/31 05:17:08 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 14414
05/31 05:17:08 Started process "/usr/sbin/condor_configd", pid and pgroup = 14415
05/31 05:18:17 Got SIGQUIT.  Performing fast shutdown.
05/31 05:18:17 Sent SIGQUIT to COLLECTOR (pid 14398)
05/31 05:18:17 Sent SIGQUIT to NEGOTIATOR (pid 14412)
05/31 05:18:17 Sent SIGQUIT to QMF_CONFIGD (pid 14415)
05/31 05:18:17 Sent SIGQUIT to SCHEDD (pid 14413)
05/31 05:18:17 Sent SIGQUIT to STARTD (pid 14414)
05/31 05:18:17 The STARTD (pid 14414) exited with status 0
05/31 05:18:17 The QMF_CONFIGD (pid 14415) exited with status 0
05/31 05:18:17 The SCHEDD (pid 14413) died due to signal 11 (Segmentation fault)
05/31 05:18:18 The NEGOTIATOR (pid 14412) exited with status 0
05/31 05:18:18 The COLLECTOR (pid 14398) exited with status 0
05/31 05:18:18 All daemons are gone.  Exiting.
05/31 05:18:19 **** condor_master (condor_MASTER) pid 14386 EXITING WITH STATUS 0


SCHEDD still crash with signal 11 - Segmentation fault. Other daemons seems to be fixed.

Comment 7 Pete MacKinnon 2010-06-01 12:35:55 UTC
Backtrace from SchedLog please. I haven't been able to reproduce this locally or on mrg27 yet. Were there any condor_submits done against this schedd before the crash?

Comment 8 Tomas Rusnak 2010-06-01 15:12:57 UTC
There was no running job in my test. It only toggle restart condor and/or qpid at random time.  

Today I updated qmf packages and test it again:

qpid-java-client-0.7.946106-3.el5
qpid-cpp-client-devel-docs-0.7.946106-1.el5
condor-wallaby-client-2.7-0.4.el5
condor-debuginfo-7.4.3-0.16.el5
qmf-0.7.946106-2.el5
qmf-devel-0.7.946106-2.el5
qpid-cpp-server-cluster-0.7.946106-2.el5
qpid-java-common-0.7.946106-3.el5
python-condorutils-1.4-0.3.el5
condor-wallaby-tools-2.7-0.4.el5
qpid-tests-0.7.946106-1.el5
condor-qmf-7.4.3-0.16.el5
qpid-cpp-server-0.7.946106-2.el5
qpid-cpp-client-ssl-0.7.946106-2.el5
qpid-cpp-mrg-debuginfo-0.7.946106-2.el5
qpid-cpp-server-ssl-0.7.946106-2.el5
qpid-cpp-server-devel-0.7.946106-2.el5
python-qpid-0.7.946106-1.el5
python-qmf-0.7.946106-3.el5
qpid-tools-0.7.946106-4.el5
qpid-cpp-server-store-0.7.946106-1.el5
condor-7.4.3-0.16.el5
qpid-cpp-client-0.7.946106-2.el5
qpid-cpp-client-devel-0.7.946106-2.el5
qpid-cpp-server-xml-0.7.946106-2.el5


MasterLog:

06/01 11:01:31 Reading from /proc/cpuinfo
06/01 11:01:31 Found: Physical-IDs:False; Core-IDs:False
06/01 11:01:31 Using processor count: 2 processors, 2 CPUs, 0 HTs
06/01 11:01:31 Reading condor configuration from '/etc/condor/condor_config'
06/01 11:01:31 ******************************************************
06/01 11:01:31 ** condor_master (CONDOR_MASTER) STARTING UP
06/01 11:01:31 ** /usr/sbin/condor_master
06/01 11:01:31 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
06/01 11:01:31 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
06/01 11:01:31 ** $CondorVersion: 7.4.3 May 27 2010 BuildID: RH-7.4.3-0.16.el5 PRE-RELEASE $
06/01 11:01:31 ** $CondorPlatform: I386-LINUX_RHEL5 $
06/01 11:01:31 ** PID = 1220
06/01 11:01:31 ** Log last touched 6/1 08:06:40
06/01 11:01:31 ******************************************************
06/01 11:01:31 Using config source: /etc/condor/condor_config
06/01 11:01:31 Using local config sources:
06/01 11:01:31    /var/lib/condor/condor_config.local
06/01 11:01:31    /var/lib/condor/config/99configd.config
06/01 11:01:31    /var/lib/condor/config/condor_config.test
06/01 11:01:31    /var/lib/condor/config/condor_config.test.default
06/01 11:01:31 Attempting to lock /var/lock/condor/InstanceLock.
06/01 11:01:31 FileLock object is updating timestamp on: /var/lock/condor/InstanceLock
06/01 11:01:31 FileLock::obtain(1) - @1275404491.086574 lock on /var/lock/condor/InstanceLock now WRITE
06/01 11:01:31 Obtained lock on /var/lock/condor/InstanceLock.
06/01 11:01:31 DaemonCore: Command Socket at <10.16.64.50:54359>
06/01 11:01:31 Will use UDP to update collector cpq-dl380-01.rhts.eng.bos.redhat.com <10.16.64.50:9618>
06/01 11:01:31 Checking for PLUGINS config option
06/01 11:01:31 MasterPlugin registration succeeded
06/01 11:01:31 Successfully loaded plugin: /usr/lib/condor/plugins/MgmtMasterPlugin-plugin.so
06/01 11:01:31 MasterPluginManager::Initialize
06/01 11:01:31 MasterPlugin initializing 1 plugins
06/01 11:01:31 MgmtMasterPlugin: Initializing...
Stack dump for process 1220 at timestamp 1275404491 (11 frames)
condor_master(dprintf_dump_stack+0x44)[0x80c8d34]
condor_master[0x80caa94]
[0x8e4420]
/usr/lib/libqmf.so.1(_ZN4qpid10management19ManagementAgentImpl4initERKNS0_18ConnectionSettingsEtbRKSs+0x4c)[0x27395c]
/usr/lib/condor/plugins/MgmtMasterPlugin-plugin.so(_ZN3qmf3com6redhat4grid6Master12registerSelfEPN4qpid10management15ManagementAgentE+0x38)[0x13c548]
/usr/lib/condor/plugins/MgmtMasterPlugin-plugin.so(_ZN16MgmtMasterPlugin10initializeEv+0x75)[0x144a25]
condor_master(_ZN19MasterPluginManager10InitializeEv+0x6c)[0x80a552c]
condor_master(_Z9main_initiPPc+0x194)[0x809f604]
condor_master(main+0xd73)[0x80c29b3]
/lib/libc.so.6(__libc_start_main+0xdc)[0x2fce9c]
condor_master[0x809dc11]

SchedLog was not touched by any process - no change in it. It crash before schedd start.

It is still needed SchedLog from previous version of qmf?

Comment 9 Pete MacKinnon 2010-06-02 13:07:08 UTC
Believed to be due to a build mismatch between qpid-cpp-client-0.7.946106-2.el5
 and condor-qmf-7.4.3-0.16.el5

Comment 10 Tomas Rusnak 2010-06-04 14:07:48 UTC
Retested on:

qpid-cpp-server-0.7.946106-2.el5
qpid-java-common-0.7.946106-3.el5
qpid-cpp-server-xml-0.7.946106-2.el5
qpid-cpp-client-devel-docs-0.7.946106-2.el5
condor-qmf-7.4.3-0.18.el5
qpid-cpp-server-store-0.7.946106-2.el5
qpid-cpp-client-ssl-0.7.946106-2.el5
python-qmf-0.7.946106-3.el5
qmf-devel-0.7.946106-2.el5
qpid-tests-0.7.946106-1.el5
python-condorutils-1.4-0.4.el5
qmf-0.7.946106-2.el5
qpid-cpp-client-devel-0.7.946106-2.el5
python-qpid-0.7.946106-1.el5
qpid-java-client-0.7.946106-3.el5
qpid-tools-0.7.946106-4.el5
qpid-cpp-mrg-debuginfo-0.7.946106-2.el5
rh-tests-distribution-MRG-Grid-grid_test_segfault_condor_with_qmf_bz534073-1.0-2
condor-debuginfo-7.4.3-0.18.el5
condor-wallaby-tools-2.7-0.5.el5
condor-wallaby-client-2.7-0.5.el5
qpid-cpp-client-0.7.946106-2.el5
qpid-cpp-server-ssl-0.7.946106-2.el5
qpid-cpp-server-cluster-0.7.946106-2.el5
qpid-cpp-server-devel-0.7.946106-2.el5
condor-7.4.3-0.18.el5

MasterLog:

06/04 09:49:18 NumberOfChildren() returning 2
06/04 09:49:19 DaemonCore: No more children processes to reap.
06/04 09:49:19 The SCHEDD (pid 11926) died due to signal 11 (Segmentation fault)
06/04 09:49:19 ProcAPI::buildFamily failed: parent 11926 not found on system.
06/04 09:49:19 ProcAPI::getProcInfo() pid 11926 does not exist.
06/04 09:49:19 ProcAPI::getProcInfo() pid 11926 does not exist.
06/04 09:49:19 ProcAPI::getProcInfo() pid 11926 does not exist.
06/04 09:49:19 ProcAPI::getProcInfo() pid 11926 does not exist.
06/04 09:49:19 ProcAPI::getProcInfo() pid 11926 does not exist.
06/04 09:49:19 ProcAPI::getProcInfo() pid 11929 does not exist.
06/04 09:49:19 ProcAPI::getProcInfo() pid 11929 does not exist.
06/04 09:49:19 ProcAPI::getProcInfo() pid 11929 does not exist.
06/04 09:49:19 ProcAPI::getProcInfo() pid 11929 does not exist.
06/04 09:49:19 ProcAPI::getProcInfo() pid 11929 does not exist.
06/04 09:49:19 NumberOfChildren() returning 1
06/04 09:49:26 DaemonCore: No more children processes to reap.
06/04 09:49:26 The STARTD (pid 11927) exited with status 0
06/04 09:49:26 ProcAPI::buildFamily failed: parent 11927 not found on system.
06/04 09:49:26 ProcAPI::getProcInfo() pid 11927 does not exist.
06/04 09:49:26 ProcAPI::getProcInfo() pid 11927 does not exist.
06/04 09:49:26 ProcAPI::getProcInfo() pid 11927 does not exist.
06/04 09:49:26 ProcAPI::getProcInfo() pid 11927 does not exist.
06/04 09:49:26 ProcAPI::getProcInfo() pid 11927 does not exist.
06/04 09:49:26 ProcAPI::getProcInfo() pid 11932 does not exist.
06/04 09:49:26 ProcAPI::getProcInfo() pid 11932 does not exist.
06/04 09:49:26 ProcAPI::getProcInfo() pid 11932 does not exist.
06/04 09:49:26 ProcAPI::getProcInfo() pid 11932 does not exist.
06/04 09:49:26 ProcAPI::getProcInfo() pid 11932 does not exist.
06/04 09:49:26 ProcAPI::getProcInfo() pid 11936 does not exist.
06/04 09:49:26 ProcAPI::getProcInfo() pid 11936 does not exist.
06/04 09:49:26 ProcAPI::getProcInfo() pid 11936 does not exist.
06/04 09:49:26 ProcAPI::getProcInfo() pid 11936 does not exist.
06/04 09:49:26 ProcAPI::getProcInfo() pid 11936 does not exist.
06/04 09:49:26 NumberOfChildren() returning 0
06/04 09:49:26 All daemons are gone.  Exiting.
06/04 09:49:26 MasterPluginManager::Shutdown
06/04 09:49:26 MgmtMasterPlugin: shutting down...
06/04 09:49:26 **** condor_master (condor_MASTER) pid 11909 EXITING WITH STATUS 0 

SchedLog:


06/04 09:49:18 (pid:11926) condor_read() failed: recv() returned -1, errno = 104 Connection reset by peer, reading 5 bytes from <10.16.66.144:9618>.
06/04 09:49:18 (pid:11926) IO: Failed to read packet header
06/04 09:49:18 (pid:11926) Failed to read ClassAd size.
06/04 09:49:18 (pid:11926) SECMAN: no classad from server, failing
06/04 09:49:18 (pid:11926) ERROR: SECMAN:2004:Failed to create security session to <10.16.66.144:9618> with TCP.|SECMAN:2007:Failed to end classad message.
06/04 09:49:18 (pid:11926) Canceled/Closed 5 socket(s) at shutdown
06/04 09:49:18 (pid:11926) MgmtScheddPlugin: shutting down...
06/04 09:49:19 (pid:11926) All shadows have been killed, exiting.
Stack dump for process 11926 at timestamp 1275659359 (8 frames)
condor_schedd(dprintf_dump_stack+0x44)[0x817d4a4]
condor_schedd[0x817f204]
[0x774420]
[0x774420]
/usr/lib/libqpidcommon.so.2(_ZN4qpid3sys6Poller3runEv+0x66)[0x1035796]
/usr/lib/libqpidcommon.so.2[0x102bb31]
/lib/libpthread.so.0[0x6ac832]
/lib/libc.so.6(clone+0x5e)[0x601e0e]
06/04 09:49:19 (pid:11926) **** condor_schedd (condor_SCHEDD) pid 11926 EXITING WITH STATUS 0

The scheduler still crashing with segfault with latest condor release.

Comment 11 Tomas Rusnak 2010-06-07 09:54:37 UTC
Created attachment 421764 [details]
strace from schedd at the crash

Strace from schedd - tested version: condor-7.4.3-0.18.el5

Comment 12 Tomas Rusnak 2010-06-15 08:47:19 UTC
Created attachment 424077 [details]
strace from schedd at the crash

Retest on latest version of packages: 

python-qpid-0.7.946106-1.el5
qpid-java-client-0.7.946106-3.el5
qpid-cpp-server-0.7.946106-2.el5
python-condorutils-1.4-0.6.el5
condor-wallaby-client-2.9-0.1.el5
qpid-cpp-client-devel-0.7.946106-2.el5
qpid-cpp-server-ssl-0.7.946106-2.el5
qpid-cpp-mrg-debuginfo-0.7.946106-2.el5
qpid-java-common-0.7.946106-3.el5
qpid-cpp-client-ssl-0.7.946106-2.el5
qpid-cpp-server-xml-0.7.946106-2.el5
qpid-cpp-server-cluster-0.7.946106-2.el5
condor-wallaby-tools-2.9-0.1.el5
condor-7.4.3-0.19.el5
python-qmf-0.7.946106-3.el5
qpid-cpp-client-0.7.946106-2.el5
qmf-devel-0.7.946106-2.el5
qpid-tests-0.7.946106-1.el5
condor-qmf-7.4.3-0.19.el5
qpid-cpp-client-devel-docs-0.7.946106-2.el5
qpid-tools-0.7.946106-4.el5
qmf-0.7.946106-2.el5
qpid-cpp-server-store-0.7.946106-2.el5
qpid-cpp-server-devel-0.7.946106-2.el5
condor-debuginfo-7.4.3-0.19.el5

06/15 04:41:40 (pid:23088) -------- Begin starting jobs --------
06/15 04:41:40 (pid:23088) -------- Done starting jobs --------
06/15 04:41:53 (pid:23088) Got SIGQUIT.  Performing fast shutdown.
06/15 04:41:53 (pid:23088) Now in shutdown_fast. Sending signals to shadows
06/15 04:41:53 (pid:23088) ScheddCronMgr: Shutting down
06/15 04:41:53 (pid:23088) CronMgr: Killing all jobs
06/15 04:41:53 (pid:23088) Trying to update collector <10.16.66.146:9618>
06/15 04:41:53 (pid:23088) Attempting to send update via UDP to collector hp-dl160g6-01.rhts.bos.redhat.com <10.16.66.146:9618>
06/15 04:41:53 (pid:23088) Canceled/Closed 5 socket(s) at shutdown
06/15 04:41:53 (pid:23088) MgmtScheddPlugin: shutting down...
06/15 04:41:53 (pid:23088) All shadows have been killed, exiting.
Stack dump for process 23088 at timestamp 1276591313 (8 frames)
condor_schedd(dprintf_dump_stack+0x44)[0x817d7d4]
06/15 04:41:53 (pid:23088) **** condor_schedd (condor_SCHEDD) pid 23088 EXITING WITH STATUS 0
condor_schedd[0x817f534]
[0x240420]
[0x240420]
/usr/lib/libqpidcommon.so.2(_ZN4qpid3sys6Poller3runEv+0x66)[0x499796]
/usr/lib/libqpidcommon.so.2[0x48fb31]
/lib/libpthread.so.0[0x94b832]
/lib/libc.so.6(clone+0x5e)[0x8a0e0e]

See attached strace.log from schedd. It's still crashing with SIGSEGV.

Comment 13 Tomas Rusnak 2010-06-18 12:31:17 UTC
Retest done on current condor-7.4.3-0.20.el5. The results on all platforms are the same as before. Same crash with same error message.

Comment 14 Matthew Farrellee 2010-06-24 22:33:02 UTC
Build into 7.4.4-0.1, also a fix into qpid, either is sufficient

Comment 15 Tomas Rusnak 2010-07-16 14:39:53 UTC
Verified over all combination of RHEL4/RHEL5 and x86, x86_64 with latest
packages:

$CondorVersion: 7.4.4 Jun 30 2010 BuildID: RH-7.4.4-0.4.el4 PRE-RELEASE $
$CondorPlatform: X86_64-LINUX_RHEL4

$CondorVersion: 7.4.4 Jun 30 2010 BuildID: RH-7.4.4-0.4.el4 PRE-RELEASE $
$CondorPlatform: I386-LINUX_RHEL4 $

$CondorVersion: 7.4.3 Mar 29 2010 BuildID: RH-7.4.4-0.4.el5 PRE-RELEASE $
$CondorPlatform: X86_64-LINUX_RHEL5 $

$CondorVersion: 7.4.3 Mar 29 2010 BuildID: RH-7.4.4-0.4.el5 PRE-RELEASE $
$CondorPlatform: I386-LINUX_RHEL5 $

99 restarts performed without any issue.

>>> VERIFIED