Description of problem: I set 4 machines as HAScheduler and HACentralManager. I submitted there many simple jobs and checked every 30s their state. I shutdown one machine with running scheduler, so I would test HAScheduler feature and watched if another scheduler start. It started. After some time I've found this stack dump in scheduler log: 02/11/11 09:54:32 (pid:20309) Got SIGQUIT. Performing fast shutdown. 02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26416] for job 257.23 02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26424] for job 257.28 02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26556] for job 257.18 02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26307] for job 257.45 02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26313] for job 257.16 02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26442] for job 257.25 02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26575] for job 257.24 02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26322] for job 257.15 02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26708] for job 257.41 02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26458] for job 257.30 02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26713] for job 257.40 02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26470] for job 257.33 02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26343] for job 257.21 02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26599] for job 257.22 02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26475] for job 257.42 02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26743] for job 257.17 02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26490] for job 257.36 02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26620] for job 257.31 02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26368] for job 257.19 02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26498] for job 257.39 02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26500] for job 257.37 02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26253] for job 257.32 02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26515] for job 257.38 02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26643] for job 257.35 02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26389] for job 257.20 02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26265] for job 257.34 02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26521] for job 257.43 02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26397] for job 257.29 02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26406] for job 257.27 02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26534] for job 257.44 02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26412] for job 257.26 02/11/11 09:54:32 (pid:20309) All shadows have been killed, exiting. 02/11/11 09:54:32 (pid:20309) **** condor_schedd (condor_SCHEDD) pid 20309 EXITING WITH STATUS 0 Stack dump for process 20309 at timestamp 1297414472 (16 frames) condor_schedd(dprintf_dump_stack+0x3f)[0x81734bf] condor_schedd[0x817381a] /lib/tls/libpthread.so.0[0x4edc98] condor_schedd(_ZN20TransferQueueRequestD1Ev+0x1d)[0x81250dd] condor_schedd(_ZN20TransferQueueManagerD1Ev+0x51)[0x8126011] condor_schedd(_ZN9SchedulerD1Ev+0x7ef)[0x80e8def] /lib/tls/libc.so.6(exit+0x77)[0x32c697] condor_schedd(__wrap_exit+0x7b)[0x815033b] condor_schedd(_Z7DC_ExitiPKc+0x135)[0x81687b5] condor_schedd(_ZN9Scheduler13shutdown_fastEv+0xec)[0x80da5ec] condor_schedd(_Z18main_shutdown_fastv+0x28)[0x80fa228] condor_schedd(_Z17handle_dc_sigquitP7Servicei+0x51)[0x816a4c1] condor_schedd(_ZN10DaemonCore6DriverEv+0xd82)[0x8155a52] condor_schedd(main+0x133e)[0x816b82e] /lib/tls/libc.so.6(__libc_start_main+0xd3)[0x316e93] condor_schedd(__gxx_personality_v0+0x179)[0x80d20e1] Version-Release number of selected component (if applicable): condor-7.4.5-0.8.el5 Steps to Reproduce: 1. set condor pool with 4 HASched, HACentralManager machine 2. submit many simple jobs 3. turn off condor with running scheduler 4. wait for stack dump Actual results: Scheduler sometime raise stack dump during its shutdown by SIGQUIT. Expected results: Scheduler will never raise stack dump during its shutdown by SIGQUIT. Additional info: small job: universe = vanilla executable = /root/pokus.sh arguments = @@time@@ requirements = (FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED) should_transfer_files = YES when_to_transfer_output = ON_EXIT Error=/tmp/mrg_$(Cluster).$(Process).err Output=/tmp/mrg_$(Cluster).$(Process).out Log=/tmp/mrg_$(Cluster).$(Process).log iwd = /tmp queue 100 where @@time@@ is substitude by random number which is less than 10.
Created attachment 479066 [details] logs and configuration of one machine. the configuration is the same on all of them
Please verify this is still a problem with condor 7.5.6-0.1
Will retest during validation cycle.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: N/A
Tested on: $CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el5 $ $CondorPlatform: I686-RedHat_5.6 $ $CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el5 $ $CondorPlatform: X86_64-RedHat_5.6 $ $CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el6 $ $CondorPlatform: I686-RedHat_6.0 $ $CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el6 $ $CondorPlatform: X86_64-RedHat_6.0 $ No stack dump. >>> VERIFIED
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2011-0889.html