Red Hat Bugzilla – Bug 474845
EC2E AMis shutting down and resetarting
Last modified: 2009-02-04 11:04:58 EST
Description of problem:
It's been observed a few times that EC2E AMIs would start and apparently run jobs, but the source jobs would move from R->C->R. The AMIs would shutdown and then restart (possibly the job being re-routed?). When this occurs, the thrashing never seems to end. No jobs complete and AMIs keep being restarted.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
This is being caused by a failure in the finalization process. The exit hook is executed and then condor fails to find a file it is expecting in the spool directory so the finalization process fails and condor sets the JobStatus = 1. This causes the job to be re-routed and a new AMI started. The reason the finalization process can't find the files it is looking for is because the exit hook is being told that the spool directory is in a different location than it should be. In the case I am seeing, a job is submitted from /home/testmonkey/ec2e and the routed job has:
SUBMIT_Iwd = "/home/testmonkey/ec2e"
Iwd = "/mnt/sharedfs/condor_ha_schedd/cluster3142.proc0.subproc0"
The spool directory that the exit hook is being given is "/home/testmonkey/ec2e", not "/mnt/sharedfs/condor_ha_schedd/cluster3142.proc0.subproc0". So the exit hook is extracting the tarball in S3 into "/home/testmonkey/ec2e" instead of the ...cluster3142... dir, thus the finalization process can't find the files it needs to complete.
The jobs are no longer forcibly spooled by the hooks, and the finalize hook does the extracting/remapping of files back to the original job's iwd, as opposed to extracting the files from AWS into the routed job's spool directory.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.