Bug 475586

Summary: amazon_gahp fails when user data file is empty
Product: Red Hat Enterprise MRG Reporter: Robert Rati <rrati>
Component: gridAssignee: Matthew Farrellee <matt>
Status: CLOSED ERRATA QA Contact: Jeff Needle <jneedle>
Severity: medium Docs Contact:
Priority: low    
Version: 1.0CC: matt
Target Milestone: 1.1   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-02-04 16:06:24 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Robert Rati 2008-12-09 18:05:52 UTC
Description of problem:
Submitted EC2E job and the routed job caused the gridmanager to dump.

Version-Release number of selected component (if applicable):
7.2.0-0.8

How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:
12/9 12:59:38 [20209] ================================>  AmazonJob::AmazonJob 1
12/9 12:59:38 [20209] Found job 2999.0 --- inserting
12/9 12:59:38 [20209] Using job type Amazon for job 3033.0
12/9 12:59:38 [20209] (3033.0) SetJobLeaseTimers()
12/9 12:59:38 [20209] ================================>  AmazonJob::AmazonJob 1
12/9 12:59:38 [20209] Found job 3033.0 --- inserting
12/9 12:59:38 [20209] Fetched 2 new job ads from schedd
12/9 12:59:38 [20209] querying for removed/held jobs
12/9 12:59:38 [20209] Using constraint ((Owner=?="testmonkey"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
12/9 12:59:38 [20209] Fetched 1 job ads from schedd
12/9 12:59:38 [20209] leaving doContactSchedd()
12/9 12:59:38 [20209] Create_Process: using fast clone() to create child process.
12/9 12:59:38 [20209] GAHP server pid = 20223
12/9 12:59:38 [20209] GAHP server version: $GahpVersion 0.0.2 Feb 15 2008 Condor AMAZONGAHP $
12/9 12:59:38 [20209] GAHP[20223] <- 'COMMANDS'
12/9 12:59:38 [20209] GAHP[20223] -> 'S' 'ASYNC_MODE_ON' 'ASYNC_MODE_OFF' 'RESULTS' 'QUIT' 'VERSION' 'COMMANDS' 'AMAZON_VM_START' 'AMAZON_VM_STOP' 'AMAZON_VM_STATUS' 'AMAZON_VM_STATUS_ALL' 'AMAZON_VM_RUNNING_KEYPAIR' 'AMAZON_VM_CREATE_KEYPAIR' 'AMAZON_VM_DESTROY_KEYPAIR' 'AMAZON_VM_KEYPAIR_NAMES'
12/9 12:59:38 [20209] GAHP[20223] <- 'ASYNC_MODE_ON'
12/9 12:59:38 [20209] GAHP[20223] -> 'S'
12/9 12:59:38 [20209] GAHP[20223] <- 'AMAZON_VM_STATUS_ALL 2 /home/testmonkey/ad /home/testmonkey/ad'
12/9 12:59:38 [20209] GAHP[20223] -> 'S'
12/9 12:59:38 [20209] *** UpdateLeases called
12/9 12:59:38 [20209]     Leases not supported, cancelling timer
12/9 12:59:38 [20209] (2999.0) doEvaluateState called: gmState GM_INIT, condorState 3
12/9 12:59:38 [20209] (2999.0) gm state change: GM_INIT -> GM_START
12/9 12:59:38 [20209] (2999.0) gm state change: GM_START -> GM_CHECK_VM
12/9 12:59:38 [20209] GAHP[20223] <- 'AMAZON_VM_STATUS_ALL 3 /tmp/cert-B4R2HOPR74CAW5EGDKWTQEUND6EX6Y2G.pem /tmp/pk-B4R2HOPR74CAW5EGDKWTQEUND6EX6Y2G.pem'
12/9 12:59:38 [20209] GAHP[20223] -> 'R'
12/9 12:59:38 [20209] GAHP[20223] -> 'S'
12/9 12:59:38 [20209] *** UpdateLeases called
12/9 12:59:38 [20209]     Leases not supported, cancelling timer
12/9 12:59:38 [20209] (3033.0) doEvaluateState called: gmState GM_INIT, condorState 1
12/9 12:59:38 [20209] (3033.0) gm state change: GM_INIT -> GM_START
12/9 12:59:38 [20209] (3033.0) gm state change: GM_START -> GM_CHECK_VM
12/9 12:59:38 [20209] GAHP[20223] <- 'RESULTS'
12/9 12:59:38 [20209] GAHP[20223] -> 'S' '1'
12/9 12:59:38 [20209] GAHP[20223] -> '2' '1' 'GAHPERROR' 'Could not read private RSA key from: /home/testmonkey/ad'
12/9 12:59:38 [20209] resource amazon is now down
12/9 12:59:38 [20209] (2999.0) doEvaluateState called: gmState GM_CHECK_VM, condorState 3
12/9 12:59:38 [20209] GAHP[20223] <- 'RESULTS'
12/9 12:59:38 [20209] GAHP[20223] -> 'R'
12/9 12:59:38 [20209] GAHP[20223] -> 'S' '1'
12/9 12:59:38 [20209] GAHP[20223] -> '3' '0' 'i-219e2148' 'terminated' 'ami-927a9efb' 'i-a59e21cc' 'terminated' 'ami-927a9efb'
12/9 12:59:38 [20209] resource amazon is now up
12/9 12:59:38 [20209] (3033.0) doEvaluateState called: gmState GM_CHECK_VM, condorState 1
12/9 12:59:38 [20209] GAHP[20223] <- 'AMAZON_VM_RUNNING_KEYPAIR 4 /tmp/cert-B4R2HOPR74CAW5EGDKWTQEUND6EX6Y2G.pem /tmp/pk-B4R2HOPR74CAW5EGDKWTQEUND6EX6Y2G.pem'
12/9 12:59:38 [20209] GAHP[20223] -> 'S'
12/9 12:59:38 [20209] GAHP[20223] <- 'RESULTS'
12/9 12:59:38 [20209] GAHP[20223] -> 'R'
12/9 12:59:38 [20209] GAHP[20223] -> 'S' '1'
12/9 12:59:38 [20209] GAHP[20223] -> '4' '0' 'i-219e2148' 'aws-keypair' 'i-a59e21cc' 'aws-keypair'
12/9 12:59:38 [20209] (3033.0) doEvaluateState called: gmState GM_CHECK_VM, condorState 1
12/9 12:59:38 [20209] (3033.0) gm state change: GM_CHECK_VM -> GM_DESTROY_KEYPAIR_SUBMIT
12/9 12:59:38 [20209] GAHP[20223] <- 'AMAZON_VM_DESTROY_KEYPAIR 5 /tmp/cert-B4R2HOPR74CAW5EGDKWTQEUND6EX6Y2G.pem /tmp/pk-B4R2HOPR74CAW5EGDKWTQEUND6EX6Y2G.pem SSH_north-08.lab.bos.redhat.com,north-14.lab.bos.redhat.com,north-15.lab.bos.redhat.com_ha-schedd@#3033.0#1228845269'
12/9 12:59:38 [20209] GAHP[20223] -> 'S' 
12/9 12:59:38 [20209] GAHP[20223] <- 'RESULTS' 
12/9 12:59:38 [20209] GAHP[20223] -> 'R'
12/9 12:59:38 [20209] GAHP[20223] -> 'S' '1'
12/9 12:59:38 [20209] GAHP[20223] -> '5' '0'
12/9 12:59:38 [20209] (3033.0) doEvaluateState called: gmState GM_DESTROY_KEYPAIR_SUBMIT, condorState 1
12/9 12:59:38 [20209] (3033.0) gm state change: GM_DESTROY_KEYPAIR_SUBMIT -> GM_CREATE_KEYPAIR
12/9 12:59:38 [20209] GAHP[20223] <- 'AMAZON_VM_CREATE_KEYPAIR 6 /tmp/cert-B4R2HOPR74CAW5EGDKWTQEUND6EX6Y2G.pem /tmp/pk-B4R2HOPR74CAW5EGDKWTQEUND6EX6Y2G.pem SSH_north-08.lab.bos.redhat.com,north-14.lab.bos.redhat.com,north-15.lab.bos.redhat.com_ha-schedd@#3033.0#1228845269 /tmp/keypair-0'
12/9 12:59:38 [20209] GAHP[20223] -> 'S'
12/9 12:59:41 [20209] GAHP[20223] <- 'RESULTS'
12/9 12:59:41 [20209] GAHP[20223] -> 'R'
12/9 12:59:41 [20209] GAHP[20223] -> 'S' '1'
12/9 12:59:41 [20209] GAHP[20223] -> '6' '0'
12/9 12:59:41 [20209] (3033.0) doEvaluateState called: gmState GM_CREATE_KEYPAIR, condorState 1
12/9 12:59:41 [20209] (3033.0) gm state change: GM_CREATE_KEYPAIR -> GM_START_VM
12/9 12:59:41 [20209] GAHP[20223] <- 'AMAZON_VM_START 7 /tmp/cert-B4R2HOPR74CAW5EGDKWTQEUND6EX6Y2G.pem /tmp/pk-B4R2HOPR74CAW5EGDKWTQEUND6EX6Y2G.pem ami-6240a40b SSH_north-08.lab.bos.redhat.com,north-14.lab.bos.redhat.com,north-15.lab.bos.redhat.com_ha-schedd@#3033.0#1228845269 NULL /tmp/aws-keys-19528 m1.small'
12/9 12:59:41 [20209] GAHP[20223] -> 'S'
12/9 12:59:41 [20209] GAHP[20223] (stderr) -> Stack dump for process 20223 at timestamp 1228845581 (12 frames)
12/9 12:59:41 [20209] GAHP[20223] (stderr) -> /usr/sbin/amazon_gahp(dprintf_dump_stack+0xc0)[0x588d33]
12/9 12:59:41 [20209] GAHP[20223] (stderr) -> /usr/sbin/amazon_gahp[0x589006]
12/9 12:59:41 [20209] GAHP[20223] (stderr) -> /lib64/libpthread.so.0[0x308ca0de70]
12/9 12:59:41 [20209] GAHP[20223] (stderr) -> /usr/sbin/amazon_gahp[0x4e8f8f]
12/9 12:59:41 [20209] GAHP[20223] (stderr) -> /usr/sbin/amazon_gahp(_ZN13AmazonVMStart12gsoapRequestEv+0x3a8)[0x4eac66]
12/9 12:59:41 [20209] GAHP[20223] (stderr) -> /usr/sbin/amazon_gahp(_ZN13AmazonRequest11SendRequestEv+0x57)[0x4e58a3]
12/9 12:59:41 [20209] GAHP[20223] (stderr) -> /usr/sbin/amazon_gahp(_ZN13AmazonVMStart14workerFunctionEPPciR8MyString+0x39e)[0x4e74ba]
12/9 12:59:41 [20209] GAHP[20223] (stderr) -> /usr/sbin/amazon_gahp(_Z17executeWorkerFuncPKcPPciR8MyString+0x8e)[0x4e209c]
12/9 12:59:41 [20209] GAHP[20223] (stderr) -> /usr/sbin/amazon_gahp[0x4de743]
12/9 12:59:41 [20209] GAHP[20223] (stderr) -> /usr/sbin/amazon_gahp[0x4df0d0]
12/9 12:59:41 [20209] GAHP[20223] (stderr) -> /lib64/libpthread.so.0[0x308ca062f7]
12/9 12:59:41 [20209] GAHP[20223] (stderr) -> /lib64/libc.so.6(clone+0x6d)[0x33cf8d1b6d]
12/9 12:59:41 [20209] GAHP[20223] <- 'RESULTS'
12/9 12:59:41 [20209] GAHP[20223] -> EOF
12/9 12:59:41 [20209] GAHP command 'RESULTS' failed
12/9 12:59:41 [20209] DaemonCore: No more children processes to reap.
12/9 12:59:41 [20209] GAHP[20223] <- 'RESULTS'
12/9 12:59:41 [20209] GAHP[20223] -> EOF
12/9 12:59:41 [20209] GAHP command 'RESULTS' failed
12/9 12:59:41 [20209] ERROR "Gahp Server (pid=20223) died due to signal 11 (Segmentation fault) unexpectedly" at line 303 in file gahp-client.cpp
Stack dump for process 20209 at timestamp 1228845581 (13 frames)
condor_gridmanager(dprintf_dump_stack+0xc0)[0x53b8a7]
condor_gridmanager[0x53bb7a]
/lib64/libc.so.6[0x33cf8301b0]
/lib64/libc.so.6(abort+0x28f)[0x33cf831d6f]
condor_gridmanager(_EXCEPT_+0x1a5)[0x53a093]
condor_gridmanager(_ZN10GahpServer6ReaperEP7Serviceii+0x122)[0x4e5396]
condor_gridmanager(_ZN10DaemonCore10CallReaperEiPKcii+0x12b)[0x520805]
condor_gridmanager(_ZN10DaemonCore17HandleProcessExitEii+0x1c0)[0x524338]
condor_gridmanager(_ZN10DaemonCore24HandleDC_SERVICEWAITPIDSEi+0x42)[0x5244b4]
condor_gridmanager(_ZN10DaemonCore6DriverEv+0x715)[0x52b3fb]
condor_gridmanager(main+0x1898)[0x5354e8]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x33cf81d8b4]
condor_gridmanager(__gxx_personality_v0+0x481)[0x48c4a9]


Expected results:


Additional info:

Comment 1 Matthew Farrellee 2008-12-09 20:50:03 UTC
gahp-client.cpp:303 will EXCEPT the gridmanager is the gahp it is communicating with crashes, which is happening here. Interesting part of the log is why the gahp is crashing, apparently a seg fault with the stack...

12/9 12:54:44 [16937] (3033.0) gm state change: GM_CREATE_KEYPAIR -> GM_START_VM
12/9 12:54:44 [16937] GAHP[16951] <- 'AMAZON_VM_START 14 /tmp/cert-B4R2HOPR74CAW5EGDKWTQEUND6EX6Y2G.pem /tmp/pk-B4R2HOPR74CAW5EGDKWTQEUND6EX6Y2G.pem ami-6240a40b SSH_north-08.lab.bos.redhat.com,north-14.lab.bos.redhat.com,north-15.lab.bos.redhat.com_ha-schedd@#3033.0#1228845269 NULL /tmp/aws-keys-19528 m1.small'
12/9 12:54:44 [16937] GAHP[16951] -> 'S'
12/9 12:54:44 [16937] GAHP[16951] (stderr) -> Stack dump for process 16951 at timestamp 1228845284 (12 frames)
12/9 12:54:44 [16937] GAHP[16951] (stderr) -> /usr/sbin/amazon_gahp(dprintf_dump_stack+0xc0)[0x588d33]
12/9 12:54:44 [16937] GAHP[16951] (stderr) -> /usr/sbin/amazon_gahp[0x589006]
12/9 12:54:44 [16937] GAHP[16951] (stderr) -> /lib64/libpthread.so.0[0x308ca0de70]
12/9 12:54:44 [16937] GAHP[16951] (stderr) -> /usr/sbin/amazon_gahp[0x4e8f8f]
12/9 12:54:44 [16937] GAHP[16951] (stderr) -> /usr/sbin/amazon_gahp(AmazonVMStart::gsoapRequest()+0x3a8)[0x4eac66]
12/9 12:54:44 [16937] GAHP[16951] (stderr) -> /usr/sbin/amazon_gahp(AmazonRequest::SendRequest()+0x57)[0x4e58a3]
12/9 12:54:44 [16937] GAHP[16951] (stderr) -> /usr/sbin/amazon_gahp(AmazonVMStart::workerFunction(char**, int, MyString&)+0x39e)[0x4e74ba]
12/9 12:54:44 [16937] GAHP[16951] (stderr) -> /usr/sbin/amazon_gahp(executeWorkerFunc(char const*, char**, int, MyString&)+0x8e)[0x4e209c]
12/9 12:54:44 [16937] GAHP[16951] (stderr) -> /usr/sbin/amazon_gahp[0x4de743]
12/9 12:54:44 [16937] GAHP[16951] (stderr) -> /usr/sbin/amazon_gahp[0x4df0d0]
12/9 12:54:44 [16937] GAHP[16951] (stderr) -> /lib64/libpthread.so.0[0x308ca062f7]
12/9 12:54:44 [16937] GAHP[16951] (stderr) -> /lib64/libc.so.6(clone+0x6d)[0x33cf8d1b6d]

Comment 2 Matthew Farrellee 2008-12-09 23:14:18 UTC
The amazon_gahp will index an array at -1 when the user data file is empty. This is fixed for 7.2.0-0.10. To reproduce you need a valid cert and pk.

# amazon_gahp 
$GahpVersion 0.0.2 Feb 15 2008 Condor\ AMAZONGAHP $
AMAZON_VM_START 0 /tmp/cert-B4R2HOPR74CAW5EGDKWTQEUND6EX6Y2G.pem /tmp/pk-B4R2HOPR74CAW5EGDKWTQEUND6EX6Y2G.pem ami-ID KEY NULL /dev/null TYPE
S
Stack dump for process 6116 at timestamp 1228864098 (12 frames)
amazon_gahp(dprintf_dump_stack+0xc0)[0x588d33]
amazon_gahp[0x589006]
/lib64/libpthread.so.0[0x308ca0de70]
amazon_gahp[0x4e8f8f]
amazon_gahp(_ZN13AmazonVMStart12gsoapRequestEv+0x3a8)[0x4eac66]
amazon_gahp(_ZN13AmazonRequest11SendRequestEv+0x57)[0x4e58a3]
amazon_gahp(_ZN13AmazonVMStart14workerFunctionEPPciR8MyString+0x39e)[0x4e74ba]
amazon_gahp(_Z17executeWorkerFuncPKcPPciR8MyString+0x8e)[0x4e209c]
amazon_gahp[0x4de743]
amazon_gahp[0x4df0d0]
/lib64/libpthread.so.0[0x308ca062f7]
/lib64/libc.so.6(clone+0x6d)[0x33cf8d1b6d]
Segmentation fault (core dumped)

Comment 5 errata-xmlrpc 2009-02-04 16:06:24 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0036.html