Bug 483587 - EC2 instance leak on failure with VM_START
EC2 instance leak on failure with VM_START
Status: CLOSED WORKSFORME
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor (Show other bugs)
1.1
All Linux
medium Severity high
: 2.1
: ---
Assigned To: Timothy St. Clair
MRG Quality Engineering
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-02-02 10:23 EST by Matthew Farrellee
Modified: 2011-08-31 08:26 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-05-31 14:59:18 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Matthew Farrellee 2009-02-02 10:23:50 EST
Description of problem:

It appears possible to leak an EC2 instance if a job has been previously run and held, and there is an error between when VM_START signals EC2 to start the instance and the instance is recorded by the gridmanager.


Version-Release number of selected component (if applicable):

7.2.1-0.3 and before


How reproducible:

100%


Steps to Reproduce:
1. submit ec2 job
2. wait for it to be schedule to ec2
3. hold the ec2 job
4. wait for it to be de-scheduled from ec2, e.g. terminated via ec2-describe-instances
5. release ec2 job
6. induce error after VM_START receives a response from ec2
7. the error should have put the job on hold, release the job
8. watch the gridmanager think the job is terminated, looking at the instance from step 4 not step 6

The error can be induced by simply swapping out the amazon_gahp with a gahp that is programmed to fail when RunInstances returns. This important thing is that the instance id returned from EC2 does not make it to the gridmanager and thus to the schedd.

  
Results:

Upon release in step 7, the gridmanager should take into account all instances that map to the job to discover the proper state for the job. The job should be mapped to the instance created in step 6.


Example log of a failure (from step 6 on):

1/31 12:35:33 [23016]    GridJobId = "amazon SSH_127.0.0.1_robin.local#1043.0#1233413406"
...
1/31 12:35:35 [23016] GAHP[23017] <- 'AMAZON_VM_START 4 
1/31 12:35:36 [23016] GAHP[23017] -> '4' '1' 'Client' 'Validation constraint violation: tag name or namespace mismatch in element <RunInstancesResponse>'
1/31 12:35:36 [23016] (1043.0) doEvaluateState called: gmState GM_START_VM, condorState 1
...
1/31 12:35:36 [23016] (1043.0) gm state change: GM_START_VM -> GM_HOLD
1/31 12:35:36 [23016] (1043.0) gm state change: GM_HOLD -> GM_DELETE
...
1/31 12:35:38 [23016]    HoldReason = "Validation constraint violation: tag name or namespace mismatch in element <RunInstancesResponse>"
...
1/31 12:35:58 [23050] GAHP[23053] <- 'AMAZON_VM_STATUS_ALL 2 ...'
1/31 12:35:58 [23050] GAHP[23053] -> '2' '0' 'i-9c901ff5' 'terminated' 'ami-cd39dca4' 'i-48af2021' 'terminated' 'ami-cd39dca4' 'i-24af204d' 'pending' 'ami-cd39dca4'
...
1/31 12:35:58 [23050] GAHP[23053] <- 'AMAZON_VM_RUNNING_KEYPAIR 3 ...'
...
1/31 12:35:59 [23050] GAHP[23053] -> '3' '0' 'i-9c901ff5' 'SSH_127.0.0.1_robin.local#1043.0#1233413406' 'i-48af2021' 'SSH_127.0.0.1_robin.local#1043.0#1233413406' 'i-24af204d' 'SSH_127.0.0.1_robin.local#1043.0#1233413406'
...
1/31 12:36:03 [23050]    GridJobId = "amazon SSH_127.0.0.1_robin.local#1043.0#1233413406 i-9c901ff5"
...
1/31 12:36:18 [23050] GAHP[23053] <- 'AMAZON_VM_STATUS 4 ... i-9c901ff5'
...
1/31 12:36:18 [23050] GAHP[23053] -> '4' '0' 'i-9c901ff5' 'terminated' 'ami-cd39dca4'
...
1/31 12:36:18 [23050] (1043.0) gm state change: GM_PROBE_JOB -> GM_SUBMITTED
1/31 12:36:18 [23050] (1043.0) gm state change: GM_SUBMITTED -> GM_DONE_SAVE

Note You need to log in before you can comment on or make changes to this bug.