Bug 451797 - failure in one ec2 job can block other jobs
failure in one ec2 job can block other jobs
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: grid (Show other bugs)
All Linux
low Severity low
: 1.1
: ---
Assigned To: Matthew Farrellee
Kim van der Riet
Depends On:
  Show dependency treegraph
Reported: 2008-06-17 09:40 EDT by Matthew Farrellee
Modified: 2008-11-04 10:00 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2008-11-04 10:00:51 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Matthew Farrellee 2008-06-17 09:40:55 EDT
When an EC2 job fails for some reason, such as improper authentication, the
entire EC2 resource is marked as down, preventing other jobs from being run.
Comment 1 Jaime Frey 2008-09-10 16:01:17 EDT
The EC2 resource is marked as down only if a AMAZON_VM_STATUS_ALL gahp command from the AmazonResource object in the gridmanager fails. There's a separate AmazonResource object for each set of key files used by jobs, so if a command fails, it should only affect other jobs using the same set of key files.
Comment 2 Matthew Farrellee 2008-09-11 00:10:27 EDT
The issue may be the credentials used to service the AMAZON_VM_STATUS_ALL command. If they are somehow invalid, the resource will be marked as down unnecessarily. I unfortunately do not remember the specifics of the issue. A tests containing valid and invalid credentials submitted by a single user, or even a mix of users, may flush it out though. The invalid credentials should be tested as the first credentials available.
Comment 3 Matthew Farrellee 2008-10-15 16:59:32 EDT
Verified this happens when a job is submitted with a bad cert, the grid resource is marked as down, the cert is then changed (cert file's contents!) and a new job is submitted using the same cert filename.

This is because the two jobs are considered to be using the same resource - same cert filename + other similarities.

If the second job uses a different file name for the cert it will not be held up by the failure from the first job.
Comment 4 Matthew Farrellee 2008-11-04 10:00:51 EST
This could also be addressed by fingerprinting the file in some way other than just its name. However, since this is an obscure issue that only occurs when modifying files that arguably shouldn't be modified during the execution of a job, I'm closing this as WONTFIX. It's also possible to argue that Condor should make its own copies of the files to avoid this issue more generally. This issue could be reopened if it becomes important enough.

Note You need to log in before you can comment on or make changes to this bug.