Description of problem: I set 4 machines as HAScheduler and HACentralManager. I submitted there many simple jobs and checked every 30s their state. I shutdown one machine with running scheduler, so I would test HAScheduler feature and watched if another scheduler start. It started. After some time I've found this stack dump in shadow log: 02/12/11 04:38:09 ****************************************************** 02/12/11 04:38:09 ** condor_shadow (CONDOR_SHADOW) STARTING UP 02/12/11 04:38:09 ** /usr/sbin/condor_shadow 02/12/11 04:38:09 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1) 02/12/11 04:38:09 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON 02/12/11 04:38:09 ** $CondorVersion: 7.4.5 Feb 4 2011 BuildID: RH-7.4.5-0.8.el5 PRE-RELEASE $ 02/12/11 04:38:09 ** $CondorPlatform: I386-LINUX_RHEL5 $ 02/12/11 04:38:09 ** PID = 27503 02/12/11 04:38:09 ** Log last touched 2/12 04:38:06 02/12/11 04:38:09 ****************************************************** 02/12/11 04:38:09 Using config source: /etc/condor/condor_config 02/12/11 04:38:09 Using local config sources: 02/12/11 04:38:09 /etc/condor/config.d/00personal_condor.config 02/12/11 04:38:09 /etc/condor/config.d/99configd.config 02/12/11 04:38:09 /var/lib/condor/wallaby_node.config 02/12/11 04:38:09 DaemonCore: Command Socket at <167:42959> 02/12/11 04:38:09 Setting maximum accepts per cycle 4. 02/12/11 04:38:09 Initializing a VANILLA shadow for job 198.84 02/12/11 04:38:10 (198.84) (27503): Request to run on slot8@dhcp-37-167m <167:39779> was ACCEPTED 02/12/11 04:38:11 (198.71) (26385): Switching to new job 218.79 02/12/11 04:38:11 (?.?) (26385): Initializing a VANILLA shadow for job 218.79 02/12/11 04:38:12 (198.50) (26379): Switching to new job 218.80 02/12/11 04:38:12 (?.?) (26379): Initializing a VANILLA shadow for job 218.80 02/12/11 04:38:12 (198.72) (26386): Retrying job cleanup, calling terminateJob() 02/12/11 04:38:14 (218.79) (26385): Request to run on slot15@dhcp-37-170 <170:34445> was ACCEPTED 02/12/11 04:38:14 (218.79) (26385): Attempting to locate disconnected starter 02/12/11 04:38:14 (218.80) (26379): Request to run on slot14@dhcp-37-170 <170:34445> was ACCEPTED Stack dump for process 26379 at timestamp 1297481894 (7 frames) condor_shadow(dprintf_dump_stack+0x44)[0x80eac84] condor_shadow[0x80ecb44] [0xcc7420] condor_shadow(_ZN10DaemonCore6DriverEv+0x244)[0x80d7ce4] condor_shadow(main+0xd80)[0x80e65e0] /lib/libc.so.6(__libc_start_main+0xdc)[0x52de9c] condor_shadow[0x80b1241] Version-Release number of selected component (if applicable): condor-7.4.5-0.8 Steps to Reproduce: 1. set condor pool with 4 HASched, HACentralManager machine 2. submit many simple jobs 3. turn off condor with running scheduler 4. wait for stack dump Actual results: Shadow raises stack dump. Expected results: Shadow will never raise stack dump. Additional info: small job: universe = vanilla executable = /root/pokus.sh arguments = @@time@@ requirements = (FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED) should_transfer_files = YES when_to_transfer_output = ON_EXIT Error=/tmp/mrg_$(Cluster).$(Process).err Output=/tmp/mrg_$(Cluster).$(Process).out Log=/tmp/mrg_$(Cluster).$(Process).log iwd = /tmp queue 100 where @@time@@ is substitude by random number which is less than 10.
Created attachment 479065 [details] logs and configuration of one machine. the configuration is the same on all of them
Please verify this is still a problem with condor 7.5.6-0.1
Will retest during validation cycle.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: N/A
Tested on: $CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el5 $ $CondorPlatform: I686-RedHat_5.6 $ $CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el5 $ $CondorPlatform: X86_64-RedHat_5.6 $ $CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el6 $ $CondorPlatform: I686-RedHat_6.0 $ $CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el6 $ $CondorPlatform: X86_64-RedHat_6.0 $ No stack dump. >>> VERIFIED
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2011-0889.html