Bug 451797

Summary: failure in one ec2 job can block other jobs
Product: Red Hat Enterprise MRG Reporter: Matthew Farrellee <matt>
Component: gridAssignee: Matthew Farrellee <matt>
Status: CLOSED WONTFIX QA Contact: Kim van der Riet <kim.vdriet>
Severity: low Docs Contact:
Priority: low    
Version: 1.0CC: jfrey
Target Milestone: 1.1   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-11-04 15:00:51 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Matthew Farrellee 2008-06-17 13:40:55 UTC
When an EC2 job fails for some reason, such as improper authentication, the
entire EC2 resource is marked as down, preventing other jobs from being run.

Comment 1 Jaime Frey 2008-09-10 20:01:17 UTC
The EC2 resource is marked as down only if a AMAZON_VM_STATUS_ALL gahp command from the AmazonResource object in the gridmanager fails. There's a separate AmazonResource object for each set of key files used by jobs, so if a command fails, it should only affect other jobs using the same set of key files.

Comment 2 Matthew Farrellee 2008-09-11 04:10:27 UTC
The issue may be the credentials used to service the AMAZON_VM_STATUS_ALL command. If they are somehow invalid, the resource will be marked as down unnecessarily. I unfortunately do not remember the specifics of the issue. A tests containing valid and invalid credentials submitted by a single user, or even a mix of users, may flush it out though. The invalid credentials should be tested as the first credentials available.

Comment 3 Matthew Farrellee 2008-10-15 20:59:32 UTC
Verified this happens when a job is submitted with a bad cert, the grid resource is marked as down, the cert is then changed (cert file's contents!) and a new job is submitted using the same cert filename.

This is because the two jobs are considered to be using the same resource - same cert filename + other similarities.

If the second job uses a different file name for the cert it will not be held up by the failure from the first job.

Comment 4 Matthew Farrellee 2008-11-04 15:00:51 UTC
This could also be addressed by fingerprinting the file in some way other than just its name. However, since this is an obscure issue that only occurs when modifying files that arguably shouldn't be modified during the execution of a job, I'm closing this as WONTFIX. It's also possible to argue that Condor should make its own copies of the files to avoid this issue more generally. This issue could be reopened if it becomes important enough.