Bug 483587

Summary:	EC2 instance leak on failure with VM_START
Product:	Red Hat Enterprise MRG	Reporter:	Matthew Farrellee <matt>
Component:	condor	Assignee:	Timothy St. Clair <tstclair>
Status:	CLOSED WORKSFORME	QA Contact:	MRG Quality Engineering <mrgqe-bugs>
Severity:	high	Docs Contact:
Priority:	medium
Version:	1.1	CC:	matt, tstclair
Target Milestone:	2.1
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-05-31 18:59:18 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Matthew Farrellee 2009-02-02 15:23:50 UTC

Description of problem:

It appears possible to leak an EC2 instance if a job has been previously run and held, and there is an error between when VM_START signals EC2 to start the instance and the instance is recorded by the gridmanager.


Version-Release number of selected component (if applicable):

7.2.1-0.3 and before


How reproducible:

100%


Steps to Reproduce:
1. submit ec2 job
2. wait for it to be schedule to ec2
3. hold the ec2 job
4. wait for it to be de-scheduled from ec2, e.g. terminated via ec2-describe-instances
5. release ec2 job
6. induce error after VM_START receives a response from ec2
7. the error should have put the job on hold, release the job
8. watch the gridmanager think the job is terminated, looking at the instance from step 4 not step 6

The error can be induced by simply swapping out the amazon_gahp with a gahp that is programmed to fail when RunInstances returns. This important thing is that the instance id returned from EC2 does not make it to the gridmanager and thus to the schedd.

  
Results:

Upon release in step 7, the gridmanager should take into account all instances that map to the job to discover the proper state for the job. The job should be mapped to the instance created in step 6.


Example log of a failure (from step 6 on):

1/31 12:35:33 [23016]    GridJobId = "amazon SSH_127.0.0.1_robin.local#1043.0#1233413406"
...
1/31 12:35:35 [23016] GAHP[23017] <- 'AMAZON_VM_START 4 
1/31 12:35:36 [23016] GAHP[23017] -> '4' '1' 'Client' 'Validation constraint violation: tag name or namespace mismatch in element <RunInstancesResponse>'
1/31 12:35:36 [23016] (1043.0) doEvaluateState called: gmState GM_START_VM, condorState 1
...
1/31 12:35:36 [23016] (1043.0) gm state change: GM_START_VM -> GM_HOLD
1/31 12:35:36 [23016] (1043.0) gm state change: GM_HOLD -> GM_DELETE
...
1/31 12:35:38 [23016]    HoldReason = "Validation constraint violation: tag name or namespace mismatch in element <RunInstancesResponse>"
...
1/31 12:35:58 [23050] GAHP[23053] <- 'AMAZON_VM_STATUS_ALL 2 ...'
1/31 12:35:58 [23050] GAHP[23053] -> '2' '0' 'i-9c901ff5' 'terminated' 'ami-cd39dca4' 'i-48af2021' 'terminated' 'ami-cd39dca4' 'i-24af204d' 'pending' 'ami-cd39dca4'
...
1/31 12:35:58 [23050] GAHP[23053] <- 'AMAZON_VM_RUNNING_KEYPAIR 3 ...'
...
1/31 12:35:59 [23050] GAHP[23053] -> '3' '0' 'i-9c901ff5' 'SSH_127.0.0.1_robin.local#1043.0#1233413406' 'i-48af2021' 'SSH_127.0.0.1_robin.local#1043.0#1233413406' 'i-24af204d' 'SSH_127.0.0.1_robin.local#1043.0#1233413406'
...
1/31 12:36:03 [23050]    GridJobId = "amazon SSH_127.0.0.1_robin.local#1043.0#1233413406 i-9c901ff5"
...
1/31 12:36:18 [23050] GAHP[23053] <- 'AMAZON_VM_STATUS 4 ... i-9c901ff5'
...
1/31 12:36:18 [23050] GAHP[23053] -> '4' '0' 'i-9c901ff5' 'terminated' 'ami-cd39dca4'
...
1/31 12:36:18 [23050] (1043.0) gm state change: GM_PROBE_JOB -> GM_SUBMITTED
1/31 12:36:18 [23050] (1043.0) gm state change: GM_SUBMITTED -> GM_DONE_SAVE