Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 760279

Summary:

ec2_gahp: transient error leads to hold jobs and leaked AMI instances

Product:

Red Hat Enterprise MRG

Reporter:

Luigi Toscano <ltoscano>

Component:

condor

Assignee:

grid-maint-list <grid-maint-list>

Status:

CLOSED WONTFIX

QA Contact:

MRG Quality Engineering <mrgqe-bugs>

Severity:

low

Docs Contact:

Priority:

low

Version:

2.1

CC:

matt, tstclair

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-05-26 19:12:58 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Example of errors and job held	none
Gridmanager.<user> log for a released job	none

Description Luigi Toscano 2011-12-05 18:28:42 UTC

Description of problem:
When a transient error with the EC2 provider occurs, ec2_gahp should handle it more gracefully.
Example of errors (from Gridmanager.<user>):

12/03/11 01:13:15 [27737] (89.0) job probe failed: E_CURL_IO: curl_easy_perform() failed (35): 'SSL connect error'.


12/04/11 05:40:47 [25661] (88.0) job probe failed: E_CURL_IO: curl_easy_perform() failed (60): 'problem with the SSL CA cert (path? access rights?)'.


12/05/11 00:42:06 [16913] (82.0) job probe failed: E_HTTP_RESPONSE_NOT_200: <?xml version="1.0" encoding="UTF-8"?>\n<Response><Errors><Error><Code>InternalError</Code><Message>Request could not be executed due to an internal service error</Message></Error></Errors><RequestID>9d8865fb-b980-4fd8-81c7-6995ba301afe</RequestID></Response>



There are two different problems:
1) when the error occurs, the associated job is moved to "Hold" state, but the associated VM is still running (see attachment)

2) when the job in the "Hold" state from step 1 is released (condor_release), the job does not start again; it stays in the Idle state, condor_q -better returns:
090.000:  Request has not yet been considered by the matchmaker.
even if ec2_gahp is started again. Maybe some field in the classad prevent a correct state transition? (GridJobStatus = "running")

ec2_gahp should avoid moving jobs for transient errors. Moreover, the transition hold->running should work, and the old instance should be killed at some point (or reattached if it would be possible to trust it again).


Version-Release number of selected component (if applicable):
I have seen this behavior on RHEL5.7 only, but I am not sure if it was just a coincidence.
condor-7.6.5-0.8

Comment 1 Luigi Toscano 2011-12-05 18:29:26 UTC

Created attachment 541040 [details]
Example of errors and job held

Comment 2 Luigi Toscano 2011-12-05 18:36:32 UTC

Created attachment 541041 [details]
Gridmanager.<user> log for a released job

This is the Gridmanager log for two jobs which where moved to hold because of an error and released (condor_release) afterwards.

Comment 4 Matthew Farrellee 2011-12-05 23:03:33 UTC

EC2's Query API passes back appropriate HTTP error codes, e.g. 401 for AuthFailure, 500 for InternalError. Those codes can be used to determine if the error is fatal (e.g. client presenting invalid credentials) or non-fatal (e.g. internal server error, try again). Fatal errors should result in a job being held, non-fatal should not. The fatal/non determination can either be made in the ec2-gahp of gridmanager, preferred location TBD. ec2-gahp errors are currently ad-hoc.

As for the Hold->Idle->Running, the correct approach is to take the instance down during the *->Hold transition.

Comment 7 Timothy St. Clair 2012-03-07 21:25:12 UTC

*** Bug 783713 has been marked as a duplicate of this bug. ***

Comment 18 Anne-Louise Tangring 2016-05-26 19:12:58 UTC

MRG-G is in maintenance only and only customer escalations will be addressed from this point forward. This issue can be re-opened if a customer escalation associated with this issue occurs.