Hide Forgot
Description of problem: Note - this occurred with the idempotent ec2_gahp and gridmanager, but I have no reason to believe it would behave differently for stock versions. EC2 instance job became held because of a curl io error. The instance remains alive. The error should be considered transient and a retry should occur before putting the job on hold. It is a bad state to have the instance running and the job held. On release the job should reconnect and have an accurate run-time. How reproducible: Unknown Actual results: GridmanagerLog.matt - 01/21/12 17:09:54 [32348] (183.0) doEvaluateState called: gmState GM_SUBMITTED, condorState 2 01/21/12 17:10:15 [32348] (183.0) doEvaluateState called: gmState GM_PROBE_JOB, condorState 2 01/21/12 17:10:15 [32348] (183.0) job probe failed: E_CURL_IO: curl_easy_perform() failed (7): 'Couldn't connect to server'. 01/21/12 17:10:15 [32348] No jobs left, shutting down 01/21/12 17:10:15 [32348] Got SIGTERM. Performing graceful shutdown. 01/21/12 17:10:15 [32348] **** condor_gridmanager (condor_GRIDMANAGER) pid 32348 EXITING WITH STATUS 0 EC2GahpLog.matt - 01/21/12 17:10:15 curl_easy_perform() failed (7): 'Couldn't connect to server'. 01/21/12 17:10:15 EOF reached on pipe 0 01/21/12 17:10:15 stdin buffer closed, exiting $ condor_q -l | grep HoldReason HoldReasonSubCode = 0 HoldReason = "curl_easy_perform() failed (7): 'Couldn't connect to server'." HoldReasonCode = 0
Could this bug a potential duplicate of rhbz760279 ?
(In reply to comment #1) > Could this bug a potential duplicate of rhbz760279 ? Or at least both may indicate a deeper issue with error propagation.
I see that rhbz760279 is MODIFIED, this bug has not been changed. Does it mean that are different?
I'm going to mark *this as a dup and we will use the other for tracking. *** This bug has been marked as a duplicate of bug 760279 ***