Description of problem: Discussion between William, Matt and Rob: William: What would the expected result of sending a job into EC2E where transfer_executable = false and the executable doesn't exist in the image? I'd have expected caroniad to throw some sort of error and the job to fail. Instead the job continues to run and shows little interest in stopping after 30 min. Matt: We need an error signal, but that didn't make it for 1.1. Right now the expected result would be that the job keeps trying, but never works. This is not necessarily an issue for pools with local resources and EC2 resources using EC2e, because the job can also be tried on local resources, where it might also fail of course. William: "job keeps trying" do you mean it fails and gets resubmitted or that ec2e in the AMI instance just keeps trying it and therefore the job just continues running in the submit node queue? 'Cause that's what I was seeing. Matt: It fails and gets resubmitted. If that's not what's happening there's a bug. Do you know that the job is attempted multiple times by the instance? Could it be the instances considers a job that doesn't run to be like not receiving a work message at all and just keeps waiting for one, eventually timing out? Honestly, if the instance is tasked with processing one message, if it gets a message and the job fails it should shutdown with an error, another message isn't coming. Rob: The AMI will attempt to run a job, and if it fails it will enable the AMI to attempt to run it again. There is a mechanism to shut down the AMI if no job is run, but it will basically be a race between that activity check and the next request from condor to process work. If the AMI is shutdown, it will be re-routed by condor and a new AMI started. This will keep happening as long as the job isn't run and in the schedd's queue, so it seems like a different side of the same coin to me. At least with a long running AMI, you have a hint something might be wrong. If the AMI is constantly shutdown, the running time will be reset for each AMI and it might be more difficult to recognize there's a problem. If we want to change things to shutdown the AMI, that's fine with me. It's an easy change, however I'm don't know it's worth changing for 1.1 because of overhead involved with new packaging (re-testing, errata, etc). I would think the correct fix for this to put the job on hold if it fails to run with an appropriate hold reason, if we can get/create one. That would seem to be more consistent with condor's philosophy. A Way to Test this: As per the original email from whenry: submit a job that needs an applicaiton that is not in the AMI. This will cause the condition.
A new attribute needs to be added to an EC2E job to control this. Submit files should include: +EC2RunAttempts = 0 EC2RunAttempts can be used in periodic expressions that are evaluated by the Job Router on the source job. This attribute is updated each time the job is attempted to be run in the AMI. Fixed in: condor-7.2.1-0.4 condor-ec2-enhanced-hooks-1.0-11 condor-remote-configuration-server-1.0-12 In the case the condor-remote-configuration-server package isn't used, the following needs to be added to condor's configuration: JOB_ROUTER_ATTRS_TO_COPY = EC2RunAttempts, EC2JobSuccessful
Additional package: condor-ec2-enhanced-1.0-8
Works as expected! Testing methodology --------------------- I made a job.submit file that did not transfer files and tried to execute shellscript which does not exist on remote machine (although has to be there locally). After some time 'condor_q -l <source_job>' contained following: ---------------------------------------------------------------- EC2RunAttempts = 10 HoldReason = "The job attribute PeriodicHold expression 'EC2RunAttempts >= 5' evaluated to TRUE" ---------------------------------------------------------------- ... that can be found below as part of condor_history output. Then after manually removing both dest and source job: # condor_history -l <source_job> | grep EC -------------------------------------------------------------------- PeriodicHold = EC2RunAttempts >= 5 EC2RunAttempts = 10 LastHoldReason = "The job attribute PeriodicHold expression 'EC2RunAttempts >= 5' evaluated to TRUE" -------------------------------------------------------------------- # condor_history -l <dest_job> | grep EC -------------------------------------------------------------------- Cmd = "EC2: Amazon Small: /mnt/sharedfs/testmonkey/north-14/ec2e/jasan2/jasan.sh" EC2RunAttempts = 11 -------------------------------------------------------------------- ------------------------------- | job.submit file follows ---------------------------------------- universe = vanilla executable = /mnt/sharedfs/testmonkey/north-14/ec2e/jasan2/jasan.sh output = stdout.$(PROCESS) error = /tmp/job.stderr.$(PROCESS) requirements = Arch == "INTEL" log = ulog.$(CLUSTER).$(PROCESS) should_transfer_files = false periodic_hold = EC2RunAttempts >= 5 +WantAWS = True +WantArch = "INTEL" +WantCPUs = 1 +AmazonKeyPairFile = "/tmp/keypair-$(PROCESS)" +EC2RunAttempts = 0 queue 1 -------------------------------------------------------------
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-0434.html