Hide Forgot
Description of problem: When a transient error with the EC2 provider occurs, ec2_gahp should handle it more gracefully. Example of errors (from Gridmanager.<user>): 12/03/11 01:13:15 [27737] (89.0) job probe failed: E_CURL_IO: curl_easy_perform() failed (35): 'SSL connect error'. 12/04/11 05:40:47 [25661] (88.0) job probe failed: E_CURL_IO: curl_easy_perform() failed (60): 'problem with the SSL CA cert (path? access rights?)'. 12/05/11 00:42:06 [16913] (82.0) job probe failed: E_HTTP_RESPONSE_NOT_200: <?xml version="1.0" encoding="UTF-8"?>\n<Response><Errors><Error><Code>InternalError</Code><Message>Request could not be executed due to an internal service error</Message></Error></Errors><RequestID>9d8865fb-b980-4fd8-81c7-6995ba301afe</RequestID></Response> There are two different problems: 1) when the error occurs, the associated job is moved to "Hold" state, but the associated VM is still running (see attachment) 2) when the job in the "Hold" state from step 1 is released (condor_release), the job does not start again; it stays in the Idle state, condor_q -better returns: 090.000: Request has not yet been considered by the matchmaker. even if ec2_gahp is started again. Maybe some field in the classad prevent a correct state transition? (GridJobStatus = "running") ec2_gahp should avoid moving jobs for transient errors. Moreover, the transition hold->running should work, and the old instance should be killed at some point (or reattached if it would be possible to trust it again). Version-Release number of selected component (if applicable): I have seen this behavior on RHEL5.7 only, but I am not sure if it was just a coincidence. condor-7.6.5-0.8
Created attachment 541040 [details] Example of errors and job held
Created attachment 541041 [details] Gridmanager.<user> log for a released job This is the Gridmanager log for two jobs which where moved to hold because of an error and released (condor_release) afterwards.
EC2's Query API passes back appropriate HTTP error codes, e.g. 401 for AuthFailure, 500 for InternalError. Those codes can be used to determine if the error is fatal (e.g. client presenting invalid credentials) or non-fatal (e.g. internal server error, try again). Fatal errors should result in a job being held, non-fatal should not. The fatal/non determination can either be made in the ec2-gahp of gridmanager, preferred location TBD. ec2-gahp errors are currently ad-hoc. As for the Hold->Idle->Running, the correct approach is to take the instance down during the *->Hold transition.
*** Bug 783713 has been marked as a duplicate of this bug. ***
MRG-G is in maintenance only and only customer escalations will be addressed from this point forward. This issue can be re-opened if a customer escalation associated with this issue occurs.