451797 – failure in one ec2 job can block other jobs

Bug 451797 - failure in one ec2 job can block other jobs

Summary: failure in one ec2 job can block other jobs

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	grid
Sub Component:
Version:	1.0
Hardware:	All
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	1.1
Target Release:	---
Assignee:	Matthew Farrellee
QA Contact:	Kim van der Riet
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-06-17 13:40 UTC by Matthew Farrellee
Modified:	2008-11-04 15:00 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-11-04 15:00:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Matthew Farrellee 2008-06-17 13:40:55 UTC

When an EC2 job fails for some reason, such as improper authentication, the
entire EC2 resource is marked as down, preventing other jobs from being run.

Comment 1 Jaime Frey 2008-09-10 20:01:17 UTC

The EC2 resource is marked as down only if a AMAZON_VM_STATUS_ALL gahp command from the AmazonResource object in the gridmanager fails. There's a separate AmazonResource object for each set of key files used by jobs, so if a command fails, it should only affect other jobs using the same set of key files.

Comment 2 Matthew Farrellee 2008-09-11 04:10:27 UTC

The issue may be the credentials used to service the AMAZON_VM_STATUS_ALL command. If they are somehow invalid, the resource will be marked as down unnecessarily. I unfortunately do not remember the specifics of the issue. A tests containing valid and invalid credentials submitted by a single user, or even a mix of users, may flush it out though. The invalid credentials should be tested as the first credentials available.

Comment 3 Matthew Farrellee 2008-10-15 20:59:32 UTC

Verified this happens when a job is submitted with a bad cert, the grid resource is marked as down, the cert is then changed (cert file's contents!) and a new job is submitted using the same cert filename.

This is because the two jobs are considered to be using the same resource - same cert filename + other similarities.

If the second job uses a different file name for the cert it will not be held up by the failure from the first job.

Comment 4 Matthew Farrellee 2008-11-04 15:00:51 UTC

This could also be addressed by fingerprinting the file in some way other than just its name. However, since this is an obscure issue that only occurs when modifying files that arguably shouldn't be modified during the execution of a job, I'm closing this as WONTFIX. It's also possible to argue that Condor should make its own copies of the files to avoid this issue more generally. This issue could be reopened if it becomes important enough.

Note You need to log in before you can comment on or make changes to this bug.