Bug 783713

Summary: Transient ec2_gahp E_CURL_IO failure -> job hold
Product: Red Hat Enterprise MRG Reporter: Matthew Farrellee <matt>
Component: condorAssignee: Timothy St. Clair <tstclair>
Status: CLOSED DUPLICATE QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: DevelopmentCC: ltoscano, matt, tstclair
Target Milestone: 2.2   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-03-07 21:25:11 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Matthew Farrellee 2012-01-21 22:42:52 UTC
Description of problem:

Note - this occurred with the idempotent ec2_gahp and gridmanager, but I have no reason to believe it would behave differently for stock versions.

EC2 instance job became held because of a curl io error. The instance remains alive. The error should be considered transient and a retry should occur before putting the job on hold.

It is a bad state to have the instance running and the job held.

On release the job should reconnect and have an accurate run-time.


How reproducible:

Unknown


Actual results:

GridmanagerLog.matt -
01/21/12 17:09:54 [32348] (183.0) doEvaluateState called: gmState GM_SUBMITTED, condorState 2
01/21/12 17:10:15 [32348] (183.0) doEvaluateState called: gmState GM_PROBE_JOB, condorState 2
01/21/12 17:10:15 [32348] (183.0) job probe failed: E_CURL_IO: curl_easy_perform() failed (7): 'Couldn't connect to server'.
01/21/12 17:10:15 [32348] No jobs left, shutting down
01/21/12 17:10:15 [32348] Got SIGTERM. Performing graceful shutdown.
01/21/12 17:10:15 [32348] **** condor_gridmanager (condor_GRIDMANAGER) pid 32348 EXITING WITH STATUS 0

EC2GahpLog.matt -
01/21/12 17:10:15 curl_easy_perform() failed (7): 'Couldn't connect to server'.
01/21/12 17:10:15 EOF reached on pipe 0
01/21/12 17:10:15 stdin buffer closed, exiting

$ condor_q -l | grep HoldReason                     
HoldReasonSubCode = 0
HoldReason = "curl_easy_perform() failed (7): 'Couldn't connect to server'."
HoldReasonCode = 0

Comment 1 Luigi Toscano 2012-01-23 09:05:15 UTC
Could this bug a potential duplicate of rhbz760279 ?

Comment 2 Matthew Farrellee 2012-01-25 19:13:18 UTC
(In reply to comment #1)
> Could this bug a potential duplicate of rhbz760279 ?

Or at least both may indicate a deeper issue with error propagation.

Comment 3 Luigi Toscano 2012-03-07 18:49:38 UTC
I see that rhbz760279 is MODIFIED, this bug has not been changed. Does it mean that are different?

Comment 4 Timothy St. Clair 2012-03-07 21:25:11 UTC
I'm going to mark *this as a dup and we will use the other for tracking.

*** This bug has been marked as a duplicate of bug 760279 ***