Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 783713

Summary:	Transient ec2_gahp E_CURL_IO failure -> job hold
Product:	Red Hat Enterprise MRG	Reporter:	Matthew Farrellee <matt>
Component:	condor	Assignee:	Timothy St. Clair <tstclair>
Status:	CLOSED DUPLICATE	QA Contact:	MRG Quality Engineering <mrgqe-bugs>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	Development	CC:	ltoscano, matt, tstclair
Target Milestone:	2.2
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-03-07 21:25:11 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Matthew Farrellee 2012-01-21 22:42:52 UTC

Description of problem:

Note - this occurred with the idempotent ec2_gahp and gridmanager, but I have no reason to believe it would behave differently for stock versions.

EC2 instance job became held because of a curl io error. The instance remains alive. The error should be considered transient and a retry should occur before putting the job on hold.

It is a bad state to have the instance running and the job held.

On release the job should reconnect and have an accurate run-time.


How reproducible:

Unknown


Actual results:

GridmanagerLog.matt -
01/21/12 17:09:54 [32348] (183.0) doEvaluateState called: gmState GM_SUBMITTED, condorState 2
01/21/12 17:10:15 [32348] (183.0) doEvaluateState called: gmState GM_PROBE_JOB, condorState 2
01/21/12 17:10:15 [32348] (183.0) job probe failed: E_CURL_IO: curl_easy_perform() failed (7): 'Couldn't connect to server'.
01/21/12 17:10:15 [32348] No jobs left, shutting down
01/21/12 17:10:15 [32348] Got SIGTERM. Performing graceful shutdown.
01/21/12 17:10:15 [32348] **** condor_gridmanager (condor_GRIDMANAGER) pid 32348 EXITING WITH STATUS 0

EC2GahpLog.matt -
01/21/12 17:10:15 curl_easy_perform() failed (7): 'Couldn't connect to server'.
01/21/12 17:10:15 EOF reached on pipe 0
01/21/12 17:10:15 stdin buffer closed, exiting

$ condor_q -l | grep HoldReason                     
HoldReasonSubCode = 0
HoldReason = "curl_easy_perform() failed (7): 'Couldn't connect to server'."
HoldReasonCode = 0

Comment 1 Luigi Toscano 2012-01-23 09:05:15 UTC

Could this bug a potential duplicate of rhbz760279 ?

Comment 2 Matthew Farrellee 2012-01-25 19:13:18 UTC

(In reply to comment #1)
> Could this bug a potential duplicate of rhbz760279 ?

Or at least both may indicate a deeper issue with error propagation.

Comment 3 Luigi Toscano 2012-03-07 18:49:38 UTC

I see that rhbz760279 is MODIFIED, this bug has not been changed. Does it mean that are different?

Comment 4 Timothy St. Clair 2012-03-07 21:25:11 UTC

I'm going to mark *this as a dup and we will use the other for tracking.

*** This bug has been marked as a duplicate of bug 760279 ***