480841 – When an EC2e job fails it continues to be in the job queue indefinitely.

Bug 480841 - When an EC2e job fails it continues to be in the job queue indefinitely.

Summary: When an EC2e job fails it continues to be in the job queue indefinitely.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	grid
Sub Component:
Version:	1.1
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	1.1.1
Target Release:	---
Assignee:	Robert Rati
QA Contact:	Jeff Needle
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-01-20 20:31 UTC by William Henry
Modified:	2009-04-21 16:18 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-04-21 16:18:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2009:0434	0	normal	SHIPPED_LIVE	Red Hat Enterprise MRG Messaging and Grid Version 1.1.1	2009-04-21 16:15:50 UTC

Description William Henry 2009-01-20 20:31:58 UTC

Description of problem:

Discussion between William, Matt and Rob:

William:
What would the expected result of sending a job into EC2E where transfer_executable = false and the executable doesn't exist in the image?

I'd have expected caroniad to throw some sort of error and the job to fail. Instead the job continues to run and shows little interest in stopping after 30 min.

Matt:
We need an error signal, but that didn't make it for 1.1.

Right now the expected result would be that the job keeps trying, but
never works. This is not necessarily an issue for pools with local
resources and EC2 resources using EC2e, because the job can also be
tried on local resources, where it might also fail of course.

William:
"job keeps trying" do you mean it fails and gets resubmitted or that
ec2e in the AMI instance just keeps trying it and therefore the job just
continues running in the submit node queue? 'Cause that's what I was seeing.

Matt:
It fails and gets resubmitted. If that's not what's happening there's a bug.

Do you know that the job is attempted multiple times by the instance?

Could it be the instances considers a job that doesn't run to be like
not receiving a work message at all and just keeps waiting for one,
eventually timing out?

Honestly, if the instance is tasked with processing one message, if it
gets a message and the job fails it should shutdown with an error,
another message isn't coming.

Rob:
The AMI will attempt to run a job, and if it fails it will enable the
AMI to attempt to run it again. There is a mechanism to shut down the
AMI if no job is run, but it will basically be a race between that
activity check and the next request from condor to process work.

If the AMI is shutdown, it will be re-routed by condor and a new AMI
started. This will keep happening as long as the job isn't run and in
the schedd's queue, so it seems like a different side of the same coin
to me. At least with a long running AMI, you have a hint something
might be wrong. If the AMI is constantly shutdown, the running time
will be reset for each AMI and it might be more difficult to recognize
there's a problem.

If we want to change things to shutdown the AMI, that's fine with me.
It's an easy change, however I'm don't know it's worth changing for 1.1
because of overhead involved with new packaging (re-testing, errata, etc).

I would think the correct fix for this to put the job on hold if it
fails to run with an appropriate hold reason, if we can get/create one.
That would seem to be more consistent with condor's philosophy.

A Way to Test this:
As per the original email from whenry: submit a job that needs an applicaiton that is not in the AMI. This will cause the condition.

Comment 1 Robert Rati 2009-02-04 23:04:11 UTC

A new attribute needs to be added to an EC2E job to control this.  Submit files should include:

+EC2RunAttempts = 0

EC2RunAttempts can be used in periodic expressions that are evaluated by the Job Router on the source job.  This attribute is updated each time the job is attempted to be run in the AMI.

Fixed in:
condor-7.2.1-0.4
condor-ec2-enhanced-hooks-1.0-11
condor-remote-configuration-server-1.0-12

In the case the condor-remote-configuration-server package isn't used, the following needs to be added to condor's configuration:

JOB_ROUTER_ATTRS_TO_COPY = EC2RunAttempts, EC2JobSuccessful

Comment 2 Robert Rati 2009-02-04 23:23:02 UTC

Additional package:
condor-ec2-enhanced-1.0-8

Comment 4 Jan Sarenik 2009-03-11 16:09:40 UTC

Works as expected!

  Testing methodology
 ---------------------

I made a job.submit file that did not transfer files and tried to execute
shellscript which does not exist on remote machine (although has to be there
locally).

After some time 'condor_q -l <source_job>' contained following:
----------------------------------------------------------------
EC2RunAttempts = 10
HoldReason = "The job attribute PeriodicHold expression
'EC2RunAttempts >= 5' evaluated to TRUE"
----------------------------------------------------------------
... that can be found below as part of condor_history output.


Then after manually removing both dest and source job:

# condor_history -l <source_job> | grep EC
--------------------------------------------------------------------
PeriodicHold = EC2RunAttempts >= 5
EC2RunAttempts = 10
LastHoldReason = "The job attribute PeriodicHold expression 'EC2RunAttempts >= 5' evaluated to TRUE"
--------------------------------------------------------------------

# condor_history -l <dest_job> | grep EC
--------------------------------------------------------------------
Cmd = "EC2: Amazon Small: /mnt/sharedfs/testmonkey/north-14/ec2e/jasan2/jasan.sh"
EC2RunAttempts = 11
--------------------------------------------------------------------


-------------------------------
| job.submit file follows
----------------------------------------
universe = vanilla
executable = /mnt/sharedfs/testmonkey/north-14/ec2e/jasan2/jasan.sh
output = stdout.$(PROCESS)
error = /tmp/job.stderr.$(PROCESS)
requirements = Arch == "INTEL"
log = ulog.$(CLUSTER).$(PROCESS)
should_transfer_files = false
periodic_hold = EC2RunAttempts >= 5
+WantAWS = True
+WantArch = "INTEL"
+WantCPUs = 1
+AmazonKeyPairFile = "/tmp/keypair-$(PROCESS)"
+EC2RunAttempts = 0
queue 1
-------------------------------------------------------------

Comment 6 errata-xmlrpc 2009-04-21 16:18:23 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0434.html

Note You need to log in before you can comment on or make changes to this bug.