Bug 698946

Summary: gridmanager crashes when a EC2/e job is submitted
Product: Red Hat Enterprise MRG Reporter: Luigi Toscano <ltoscano>
Component: condorAssignee: Robert Rati <rrati>
Status: CLOSED ERRATA QA Contact: Luigi Toscano <ltoscano>
Severity: urgent Docs Contact:
Priority: urgent    
Version: DevelopmentCC: iboverma, matt
Target Milestone: 2.0   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: condor-7.6.1-0.4 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-06-27 14:11:33 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 679553    
Attachments:
Description Flags
Sample job used for testing none

Description Luigi Toscano 2011-04-22 11:48:08 UTC
Description of problem:
When a EC2/e job is submitted, gridmanager crashes.

SchedLog:
04/22/11 13:35:47 (pid:14819) Started condor_gmanager for owner condor pid=14918
04/22/11 13:35:55 (pid:14819) condor_gridmanager (PID 14918, owner condor) exited due to signal 11 (Segmentation fault).


GridmanagerLog.<user>:
04/22/11 13:35:50 [14918] ================================>  EC2Job::EC2Job 1 
04/22/11 13:35:50 [14918] Found job 2.0 --- inserting
04/22/11 13:35:50 [14918] (2.0) doEvaluateState called: gmState GM_HOLD, condorState 1
Stack dump for process 14918 at timestamp 1303472155 (11 frames)
condor_gridmanager(dprintf_dump_stack+0x4a)[0x81aedba]
condor_gridmanager[0x81d8a66]
[0xcfc420]
/lib/libc.so.6(cfree+0x34)[0x378ae4]
condor_gridmanager(_ZN6EC2JobD0Ev+0x59)[0x80b6f69]
condor_gridmanager(_Z15doContactScheddv+0xe3a)[0x80d8a2a]
condor_gridmanager(_ZN12TimerManager7TimeoutEv+0x390)[0x8158df0]
condor_gridmanager(_ZN10DaemonCore6DriverEv+0x282)[0x81460a2]
condor_gridmanager(main+0x116f)[0x815cb5f]
/lib/libc.so.6(__libc_start_main+0xdc)[0x324e9c]
condor_gridmanager[0x80b58c1]


The testing environment was configured both manually and using wallaby.


Version-Release number of selected component (if applicable):
condor-7.6.1-0.1.el5
condor-classads-7.6.1-0.1.el5
condor-ec2-enhanced-hooks-1.1-3.el5
condor-qmf-7.6.1-0.1.el5
condor-wallaby-client-4.0-5.el5
python-condorec2e-1.1-3.el5
python-condorutils-1.5-2.el5

Found on RHEL5.6 and RHEL6.1Beta, i386/x86_64.

Comment 1 Luigi Toscano 2011-04-26 17:57:05 UTC
Created attachment 495003 [details]
Sample job used for testing

Comment 2 Robert Rati 2011-04-26 19:22:07 UTC
Issue was with uninitialized pointers being freed.  The EC2Job constructor
could exit w/o all pointers being initialized.

Fixed upstream on V7_6-branch

Comment 3 Luigi Toscano 2011-05-17 17:11:20 UTC
Fix applied, condor_gridmanager does not crash anymore.

Verified on RHEL5.6/6.1, i386/x86_64.
condor-7.6.1-0.4