Bug 677925 - Shadow raise stack dump
Summary: Shadow raise stack dump
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 1.3
Hardware: Unspecified
OS: Linux
medium
medium
Target Milestone: 2.0
: ---
Assignee: Matthew Farrellee
QA Contact: Lubos Trilety
URL:
Whiteboard:
Depends On:
Blocks: 693778
TreeView+ depends on / blocked
 
Reported: 2011-02-16 09:56 UTC by Martin Kudlej
Modified: 2011-06-23 15:39 UTC (History)
2 users (show)

Fixed In Version: condor-7.5.6-0.1
Doc Type: Bug Fix
Doc Text:
N/A
Clone Of:
Environment:
Last Closed: 2011-06-23 15:39:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs and configuration of one machine. the configuration is the same on all of them (2.44 MB, application/x-gzip)
2011-02-16 09:57 UTC, Martin Kudlej
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2011:0889 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Grid 2.0 Release 2011-06-23 15:35:53 UTC

Description Martin Kudlej 2011-02-16 09:56:32 UTC
Description of problem:
I set 4 machines as HAScheduler and HACentralManager. I submitted there many simple jobs and checked every 30s their state. I shutdown one machine with running scheduler, so I would test HAScheduler feature and watched if another scheduler start. It started. After some time I've found this stack dump in shadow log:
    02/12/11 04:38:09 ******************************************************
    02/12/11 04:38:09 ** condor_shadow (CONDOR_SHADOW) STARTING UP
    02/12/11 04:38:09 ** /usr/sbin/condor_shadow
    02/12/11 04:38:09 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
    02/12/11 04:38:09 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
    02/12/11 04:38:09 ** $CondorVersion: 7.4.5 Feb  4 2011 BuildID: RH-7.4.5-0.8.el5 PRE-RELEASE $
    02/12/11 04:38:09 ** $CondorPlatform: I386-LINUX_RHEL5 $
    02/12/11 04:38:09 ** PID = 27503
    02/12/11 04:38:09 ** Log last touched 2/12 04:38:06
    02/12/11 04:38:09 ******************************************************
    02/12/11 04:38:09 Using config source: /etc/condor/condor_config
    02/12/11 04:38:09 Using local config sources:
    02/12/11 04:38:09    /etc/condor/config.d/00personal_condor.config
    02/12/11 04:38:09    /etc/condor/config.d/99configd.config
    02/12/11 04:38:09    /var/lib/condor/wallaby_node.config
    02/12/11 04:38:09 DaemonCore: Command Socket at <167:42959>
    02/12/11 04:38:09 Setting maximum accepts per cycle 4.
    02/12/11 04:38:09 Initializing a VANILLA shadow for job 198.84
    02/12/11 04:38:10 (198.84) (27503): Request to run on slot8@dhcp-37-167m <167:39779> was ACCEPTED
    02/12/11 04:38:11 (198.71) (26385): Switching to new job 218.79
    02/12/11 04:38:11 (?.?) (26385): Initializing a VANILLA shadow for job 218.79
    02/12/11 04:38:12 (198.50) (26379): Switching to new job 218.80
    02/12/11 04:38:12 (?.?) (26379): Initializing a VANILLA shadow for job 218.80
    02/12/11 04:38:12 (198.72) (26386): Retrying job cleanup, calling terminateJob()
    02/12/11 04:38:14 (218.79) (26385): Request to run on slot15@dhcp-37-170 <170:34445> was ACCEPTED
    02/12/11 04:38:14 (218.79) (26385): Attempting to locate disconnected starter
    02/12/11 04:38:14 (218.80) (26379): Request to run on slot14@dhcp-37-170 <170:34445> was ACCEPTED
    Stack dump for process 26379 at timestamp 1297481894 (7 frames)
    condor_shadow(dprintf_dump_stack+0x44)[0x80eac84]
    condor_shadow[0x80ecb44]
    [0xcc7420]
    condor_shadow(_ZN10DaemonCore6DriverEv+0x244)[0x80d7ce4]
    condor_shadow(main+0xd80)[0x80e65e0]
    /lib/libc.so.6(__libc_start_main+0xdc)[0x52de9c]
    condor_shadow[0x80b1241] 

Version-Release number of selected component (if applicable):
condor-7.4.5-0.8

Steps to Reproduce:
1. set condor pool with 4 HASched, HACentralManager machine
2. submit many simple jobs
3. turn off condor with running scheduler
4. wait for stack dump
  
Actual results:
Shadow raises stack dump.

Expected results:
Shadow will never raise stack dump.

Additional info:
small job:
universe = vanilla
executable = /root/pokus.sh
arguments = @@time@@
requirements = (FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED)
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
Error=/tmp/mrg_$(Cluster).$(Process).err
Output=/tmp/mrg_$(Cluster).$(Process).out
Log=/tmp/mrg_$(Cluster).$(Process).log
iwd = /tmp

queue 100

where @@time@@ is substitude by random number which is less than 10.

Comment 1 Martin Kudlej 2011-02-16 09:57:42 UTC
Created attachment 479065 [details]
logs and configuration of one machine. the configuration is the same on all of them

Comment 2 Matthew Farrellee 2011-02-28 19:43:45 UTC
Please verify this is still a problem with condor 7.5.6-0.1

Comment 4 Martin Kudlej 2011-03-04 14:43:01 UTC
Will retest during validation cycle.

Comment 5 Matthew Farrellee 2011-04-27 20:22:40 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
N/A

Comment 7 Lubos Trilety 2011-05-16 13:14:33 UTC
Tested on:
$CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el5 $
$CondorPlatform: I686-RedHat_5.6 $

$CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el5 $
$CondorPlatform: X86_64-RedHat_5.6 $

$CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el6 $
$CondorPlatform: I686-RedHat_6.0 $

$CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el6 $
$CondorPlatform: X86_64-RedHat_6.0 $

No stack dump.

>>> VERIFIED

Comment 8 errata-xmlrpc 2011-06-23 15:39:28 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2011-0889.html


Note You need to log in before you can comment on or make changes to this bug.