Red Hat Bugzilla – Bug 451797
failure in one ec2 job can block other jobs
Last modified: 2008-11-04 10:00:51 EST
When an EC2 job fails for some reason, such as improper authentication, the
entire EC2 resource is marked as down, preventing other jobs from being run.
The EC2 resource is marked as down only if a AMAZON_VM_STATUS_ALL gahp command from the AmazonResource object in the gridmanager fails. There's a separate AmazonResource object for each set of key files used by jobs, so if a command fails, it should only affect other jobs using the same set of key files.
The issue may be the credentials used to service the AMAZON_VM_STATUS_ALL command. If they are somehow invalid, the resource will be marked as down unnecessarily. I unfortunately do not remember the specifics of the issue. A tests containing valid and invalid credentials submitted by a single user, or even a mix of users, may flush it out though. The invalid credentials should be tested as the first credentials available.
Verified this happens when a job is submitted with a bad cert, the grid resource is marked as down, the cert is then changed (cert file's contents!) and a new job is submitted using the same cert filename.
This is because the two jobs are considered to be using the same resource - same cert filename + other similarities.
If the second job uses a different file name for the cert it will not be held up by the failure from the first job.
This could also be addressed by fingerprinting the file in some way other than just its name. However, since this is an obscure issue that only occurs when modifying files that arguably shouldn't be modified during the execution of a job, I'm closing this as WONTFIX. It's also possible to argue that Condor should make its own copies of the files to avoid this issue more generally. This issue could be reopened if it becomes important enough.