Bug 677919 - Scheduler creates stack dump during shutdown
Summary: Scheduler creates stack dump during shutdown
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 1.3
Hardware: Unspecified
OS: Linux
medium
medium
Target Milestone: 2.0
: ---
Assignee: Matthew Farrellee
QA Contact: Lubos Trilety
URL:
Whiteboard:
Depends On:
Blocks: 693778
TreeView+ depends on / blocked
 
Reported: 2011-02-16 09:41 UTC by Martin Kudlej
Modified: 2011-06-23 15:39 UTC (History)
3 users (show)

Fixed In Version: condor-7.5.6-0.1
Doc Type: Bug Fix
Doc Text:
N/A
Clone Of:
Environment:
Last Closed: 2011-06-23 15:39:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs and configuration of one machine. the configuration is the same on all of them (2.44 MB, application/x-gzip)
2011-02-16 09:58 UTC, Martin Kudlej
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2011:0889 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Grid 2.0 Release 2011-06-23 15:35:53 UTC

Description Martin Kudlej 2011-02-16 09:41:40 UTC
Description of problem:
I set 4 machines as HAScheduler and HACentralManager. I submitted there many simple jobs and checked every 30s their state. I shutdown one machine with running scheduler, so I would test HAScheduler feature and watched if another scheduler start. It started. After some time I've found this stack dump in scheduler log:
    02/11/11 09:54:32 (pid:20309) Got SIGQUIT.  Performing fast shutdown.
    02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26416] for job 257.23
    02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26424] for job 257.28
    02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26556] for job 257.18
    02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26307] for job 257.45
    02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26313] for job 257.16
    02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26442] for job 257.25
    02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26575] for job 257.24
    02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26322] for job 257.15
    02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26708] for job 257.41
    02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26458] for job 257.30
    02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26713] for job 257.40
    02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26470] for job 257.33
    02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26343] for job 257.21
    02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26599] for job 257.22
    02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26475] for job 257.42
    02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26743] for job 257.17
    02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26490] for job 257.36
    02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26620] for job 257.31
    02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26368] for job 257.19
    02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26498] for job 257.39
    02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26500] for job 257.37
    02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26253] for job 257.32
    02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26515] for job 257.38
    02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26643] for job 257.35
    02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26389] for job 257.20
    02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26265] for job 257.34
    02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26521] for job 257.43
    02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26397] for job 257.29
    02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26406] for job 257.27
    02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26534] for job 257.44
    02/11/11 09:54:32 (pid:20309) Sent signal 9 to shadow [pid 26412] for job 257.26
    02/11/11 09:54:32 (pid:20309) All shadows have been killed, exiting.
    02/11/11 09:54:32 (pid:20309) **** condor_schedd (condor_SCHEDD) pid 20309 EXITING WITH STATUS 0
    Stack dump for process 20309 at timestamp 1297414472 (16 frames)
    condor_schedd(dprintf_dump_stack+0x3f)[0x81734bf]
    condor_schedd[0x817381a]
    /lib/tls/libpthread.so.0[0x4edc98]
    condor_schedd(_ZN20TransferQueueRequestD1Ev+0x1d)[0x81250dd]
    condor_schedd(_ZN20TransferQueueManagerD1Ev+0x51)[0x8126011]
    condor_schedd(_ZN9SchedulerD1Ev+0x7ef)[0x80e8def]
    /lib/tls/libc.so.6(exit+0x77)[0x32c697]
    condor_schedd(__wrap_exit+0x7b)[0x815033b]
    condor_schedd(_Z7DC_ExitiPKc+0x135)[0x81687b5]
    condor_schedd(_ZN9Scheduler13shutdown_fastEv+0xec)[0x80da5ec]
    condor_schedd(_Z18main_shutdown_fastv+0x28)[0x80fa228]
    condor_schedd(_Z17handle_dc_sigquitP7Servicei+0x51)[0x816a4c1]
    condor_schedd(_ZN10DaemonCore6DriverEv+0xd82)[0x8155a52]
    condor_schedd(main+0x133e)[0x816b82e]
    /lib/tls/libc.so.6(__libc_start_main+0xd3)[0x316e93]
    condor_schedd(__gxx_personality_v0+0x179)[0x80d20e1] 

Version-Release number of selected component (if applicable):
condor-7.4.5-0.8.el5

Steps to Reproduce:
1. set condor pool with 4 HASched, HACentralManager machine
2. submit many simple jobs
3. turn off condor with running scheduler
4. wait for stack dump
  
Actual results:
Scheduler sometime raise stack dump during its shutdown by SIGQUIT.

Expected results:
Scheduler will never raise stack dump during its shutdown by SIGQUIT.

Additional info:
small job:
universe = vanilla
executable = /root/pokus.sh
arguments = @@time@@
requirements = (FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED)
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
Error=/tmp/mrg_$(Cluster).$(Process).err
Output=/tmp/mrg_$(Cluster).$(Process).out
Log=/tmp/mrg_$(Cluster).$(Process).log
iwd = /tmp

queue 100

where @@time@@ is substitude by random number which is less than 10.

Comment 1 Martin Kudlej 2011-02-16 09:58:02 UTC
Created attachment 479066 [details]
logs and configuration of one machine. the configuration is the same on all of them

Comment 2 Matthew Farrellee 2011-02-28 19:43:47 UTC
Please verify this is still a problem with condor 7.5.6-0.1

Comment 3 Martin Kudlej 2011-03-04 14:43:01 UTC
Will retest during validation cycle.

Comment 4 Matthew Farrellee 2011-04-27 20:22:26 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
N/A

Comment 6 Lubos Trilety 2011-05-16 13:14:37 UTC
Tested on:
$CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el5 $
$CondorPlatform: I686-RedHat_5.6 $

$CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el5 $
$CondorPlatform: X86_64-RedHat_5.6 $

$CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el6 $
$CondorPlatform: I686-RedHat_6.0 $

$CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el6 $
$CondorPlatform: X86_64-RedHat_6.0 $

No stack dump.

>>> VERIFIED

Comment 7 errata-xmlrpc 2011-06-23 15:39:24 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2011-0889.html


Note You need to log in before you can comment on or make changes to this bug.