Bug 783713 - Transient ec2_gahp E_CURL_IO failure -> job hold
Summary: Transient ec2_gahp E_CURL_IO failure -> job hold
Keywords:
Status: CLOSED DUPLICATE of bug 760279
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: Development
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: 2.2
: ---
Assignee: Timothy St. Clair
QA Contact: MRG Quality Engineering
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-01-21 22:42 UTC by Matthew Farrellee
Modified: 2012-04-03 15:02 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-03-07 21:25:11 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Matthew Farrellee 2012-01-21 22:42:52 UTC
Description of problem:

Note - this occurred with the idempotent ec2_gahp and gridmanager, but I have no reason to believe it would behave differently for stock versions.

EC2 instance job became held because of a curl io error. The instance remains alive. The error should be considered transient and a retry should occur before putting the job on hold.

It is a bad state to have the instance running and the job held.

On release the job should reconnect and have an accurate run-time.


How reproducible:

Unknown


Actual results:

GridmanagerLog.matt -
01/21/12 17:09:54 [32348] (183.0) doEvaluateState called: gmState GM_SUBMITTED, condorState 2
01/21/12 17:10:15 [32348] (183.0) doEvaluateState called: gmState GM_PROBE_JOB, condorState 2
01/21/12 17:10:15 [32348] (183.0) job probe failed: E_CURL_IO: curl_easy_perform() failed (7): 'Couldn't connect to server'.
01/21/12 17:10:15 [32348] No jobs left, shutting down
01/21/12 17:10:15 [32348] Got SIGTERM. Performing graceful shutdown.
01/21/12 17:10:15 [32348] **** condor_gridmanager (condor_GRIDMANAGER) pid 32348 EXITING WITH STATUS 0

EC2GahpLog.matt -
01/21/12 17:10:15 curl_easy_perform() failed (7): 'Couldn't connect to server'.
01/21/12 17:10:15 EOF reached on pipe 0
01/21/12 17:10:15 stdin buffer closed, exiting

$ condor_q -l | grep HoldReason                     
HoldReasonSubCode = 0
HoldReason = "curl_easy_perform() failed (7): 'Couldn't connect to server'."
HoldReasonCode = 0

Comment 1 Luigi Toscano 2012-01-23 09:05:15 UTC
Could this bug a potential duplicate of rhbz760279 ?

Comment 2 Matthew Farrellee 2012-01-25 19:13:18 UTC
(In reply to comment #1)
> Could this bug a potential duplicate of rhbz760279 ?

Or at least both may indicate a deeper issue with error propagation.

Comment 3 Luigi Toscano 2012-03-07 18:49:38 UTC
I see that rhbz760279 is MODIFIED, this bug has not been changed. Does it mean that are different?

Comment 4 Timothy St. Clair 2012-03-07 21:25:11 UTC
I'm going to mark *this as a dup and we will use the other for tracking.

*** This bug has been marked as a duplicate of bug 760279 ***


Note You need to log in before you can comment on or make changes to this bug.