Bug 760279

Summary: ec2_gahp: transient error leads to hold jobs and leaked AMI instances
Product: Red Hat Enterprise MRG Reporter: Luigi Toscano <ltoscano>
Component: condorAssignee: grid-maint-list <grid-maint-list>
Status: CLOSED WONTFIX QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: low Docs Contact:
Priority: low    
Version: 2.1CC: matt, tstclair
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-26 19:12:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Example of errors and job held
none
Gridmanager.<user> log for a released job none

Description Luigi Toscano 2011-12-05 18:28:42 UTC
Description of problem:
When a transient error with the EC2 provider occurs, ec2_gahp should handle it more gracefully.
Example of errors (from Gridmanager.<user>):

12/03/11 01:13:15 [27737] (89.0) job probe failed: E_CURL_IO: curl_easy_perform() failed (35): 'SSL connect error'.


12/04/11 05:40:47 [25661] (88.0) job probe failed: E_CURL_IO: curl_easy_perform() failed (60): 'problem with the SSL CA cert (path? access rights?)'.


12/05/11 00:42:06 [16913] (82.0) job probe failed: E_HTTP_RESPONSE_NOT_200: <?xml version="1.0" encoding="UTF-8"?>\n<Response><Errors><Error><Code>InternalError</Code><Message>Request could not be executed due to an internal service error</Message></Error></Errors><RequestID>9d8865fb-b980-4fd8-81c7-6995ba301afe</RequestID></Response>



There are two different problems:
1) when the error occurs, the associated job is moved to "Hold" state, but the associated VM is still running (see attachment)

2) when the job in the "Hold" state from step 1 is released (condor_release), the job does not start again; it stays in the Idle state, condor_q -better returns:
090.000:  Request has not yet been considered by the matchmaker.
even if ec2_gahp is started again. Maybe some field in the classad prevent a correct state transition? (GridJobStatus = "running")

ec2_gahp should avoid moving jobs for transient errors. Moreover, the transition hold->running should work, and the old instance should be killed at some point (or reattached if it would be possible to trust it again).


Version-Release number of selected component (if applicable):
I have seen this behavior on RHEL5.7 only, but I am not sure if it was just a coincidence.
condor-7.6.5-0.8

Comment 1 Luigi Toscano 2011-12-05 18:29:26 UTC
Created attachment 541040 [details]
Example of errors and job held

Comment 2 Luigi Toscano 2011-12-05 18:36:32 UTC
Created attachment 541041 [details]
Gridmanager.<user> log for a released job

This is the Gridmanager log for two jobs which where moved to hold because of an error and released (condor_release) afterwards.

Comment 4 Matthew Farrellee 2011-12-05 23:03:33 UTC
EC2's Query API passes back appropriate HTTP error codes, e.g. 401 for AuthFailure, 500 for InternalError. Those codes can be used to determine if the error is fatal (e.g. client presenting invalid credentials) or non-fatal (e.g. internal server error, try again). Fatal errors should result in a job being held, non-fatal should not. The fatal/non determination can either be made in the ec2-gahp of gridmanager, preferred location TBD. ec2-gahp errors are currently ad-hoc.

As for the Hold->Idle->Running, the correct approach is to take the instance down during the *->Hold transition.

Comment 7 Timothy St. Clair 2012-03-07 21:25:12 UTC
*** Bug 783713 has been marked as a duplicate of this bug. ***

Comment 18 Anne-Louise Tangring 2016-05-26 19:12:58 UTC
MRG-G is in maintenance only and only customer escalations will be addressed from this point forward. This issue can be re-opened if a customer escalation associated with this issue occurs.